Document analysis architecture

ABSTRACT

Systems and methods for generation and use of document analysis architectures are disclosed. A model builder component may be utilized to receiving user input data for labeling a set of documents as in class or out of class. That user input data may be utilized to train one or more classification models, which may then be utilized to predict classification of other documents. Trained models may be incorporated into a model taxonomy for searching and use by other users for document analysis purposes.

RELATED APPLICATIONS

This application claims priority to and is a continuation of U.S. patentapplication Ser. No. 16/897,559, filed on Jun. 10, 2020, which willissue as U.S. Pat. No. 11,373,424 on Jun. 28, 2022, the entire contentsof which are incorporated herein by reference.

BACKGROUND

Determining the similarities, differences, and classification ofinformation, such as a document, in association with other informationcan be valuable. However, quantifying attributes of document analysis,particularly in large corpuses of documents, is difficult. Describedherein are improvements in technology and solutions to technicalproblems that can be used to, among other things, generate and utilize adocument analysis architecture.

BRIEF DESCRIPTION OF THE DRAWINGS

The detailed description is set forth below with reference to theaccompanying figures. In the figures, the left-most digit(s) of areference number identifies the figure in which the reference numberfirst appears. The use of the same reference numbers in differentfigures indicates similar or identical items. The systems depicted inthe accompanying figures are not to scale and components within thefigures may be depicted not to scale with each other.

FIG. 1 illustrates a conceptual diagram of a user interface that mayreceive documents and utilize a document analysis architecture toanalyze the documents.

FIG. 2 illustrates a schematic diagram of an example environment for adocument analysis architecture.

FIG. 3A illustrates an example user interface displaying document dataand user input data for classification model building.

FIG. 3B illustrates an example user interface with functionality fordisplaying document data and for accepting user input.

FIG. 4 illustrates an example user interface displaying informationassociated with use of a classification model for determining whichportions of a document set are in class.

FIG. 5 illustrates an example user interface displaying informationassociated with confidence values associated with a document analysisarchitecture.

FIG. 6 illustrates an example user interface displaying document dataand indications of relationships between documents.

FIG. 7A illustrates an example user interface displaying keywordsdetermined to be in class and keywords determined to be out of class bya classification model in a word-cloud format.

FIG. 7B illustrates an example user interface displaying keywordsdetermined to be in class and keywords determined to be out of class bya classification model in a list format.

FIG. 8 illustrates a flow diagram of an example process for determiningmodel accuracy.

FIG. 9 illustrates a flow diagram of an example process for generating apositive dataset of keywords and a negative dataset of keywords forclassification model training.

FIG. 10 illustrates a flow diagram of an example process for generatinga positive dataset of vectors and a negative dataset of vectors forclassification model training.

FIG. 11 illustrates a conceptual diagram of an example process forutilizing document categorization for receiving user input data and/orfor training classification models.

FIG. 12 illustrates a conceptual diagram of an example process forreceiving user input data for training classification models.

FIG. 13 illustrates a flow diagram of an example process for utilizing aclassification model to determine whether a given document is in classor out of class.

FIG. 14 illustrates a flow diagram of an example process for determininga labelling influence value associated with receiving user input data toretrain a classification model.

FIG. 15 illustrates a conceptual diagram of an example model taxonomy.

FIG. 16 illustrates a flow diagram of an example process for determiningportions of a classification model associated with confidentialinformation and generating a modified classification model withoutassociation to the confidential information.

FIG. 17 illustrates a conceptual diagram of an example process forpresenting at least a portion of a model taxonomy based at least in parton a user query and utilizing a selected model for document analysis.

FIG. 18 illustrates a conceptual diagram of an example model taxonomyshowing gaps in the taxonomy where classification models have not beentrained and/or where the classification models require further training.

FIG. 19 illustrates a conceptual diagram of an example process fordetermining which models in a model taxonomy to present in response to auser query for utilizing a model.

FIG. 20 illustrates a flow diagram of an example process for determiningwhether to utilizing a model from a model taxonomy or whether to requesttraining of a new model.

FIG. 21 illustrates a flow diagram of an example process for utilizinguser input data to build classification models.

FIG. 22 illustrates a flow diagram of another example process forutilizing user input data to build classification models.

FIG. 23 illustrates a flow diagram of an example process for buildingclassification models utilizing negative datasets.

FIG. 24 illustrates a flow diagram of another example process forbuilding classification models utilizing negative datasets.

FIG. 25 illustrates a flow diagram of an example process for utilizingclassification models for determining classification of exampledocuments.

FIG. 26 illustrates a flow diagram of another example process forutilizing classification models for determining classification ofexample documents.

FIG. 27 illustrates a flow diagram of an example process for buildingmodel taxonomies.

FIG. 28 illustrates a flow diagram of another example process forutilizing model taxonomies.

FIG. 29 illustrates a flow diagram of an example process for searchingclassification models using taxonomies.

FIG. 30 illustrates a flow diagram of another example process forsearching classification models using taxonomies.

DETAILED DESCRIPTION

Systems and methods for generation and use of a document analysisarchitecture are disclosed. Take, for example, an entity that would findit beneficial to utilize a platform to determine which documents in aset of documents are in class for a given purpose and which of thedocuments in the set of documents are out of class for that givenpurpose. For example, an entity may desire to know which patents and/orpatent applications are most relevant for patentability determinations,for infringement determinations, for asset acquisition purposes, forresearch and development purposes, for insurance purposes, etc.Generally, a user may search a database of such documents utilizingkeyword searching. To gather a reasonable number of results that doesnot unduly limit the documents in those results, users may employ broadkeyword searching and then review each document to determine whethereach document should be considered in class or out of class for thepurposes at hand. However, taking patents and patent applications as anexample, the potential corpus of documents, even if looking just topatents and patent applications filed in the United States, easilynumbers in the thousands if not tens of thousands or more. In light ofthis, a document analysis platform that is configured to intakedocuments, receive marginal user input to train classification models,and then to use those classification models to determine which documentsin a set of documents are in class would be beneficial. Additionally, amodel taxonomy that includes previously-trained classification models ina searchable and utilizable fashion may be desirable to make documentclassification determinations across users in a time sensitive, accuratemanner.

Described herein is a document analysis platform that is configured toproduce classification determinations associated with document sets,such as patents and patent applications. The platform may include amodel building component and a model library component. Generally, themodel building component may be utilized to build or otherwise trainclassification models for determining whether a given document is inclass or out of class. The model library may represent a taxonomy ofclassification models that are related to each other in a taxonomy treeor otherwise through a subject matter hierarchy. The document analysisplatform may be accessible to users via one or more user interfaces thatmay be configured to display information associated with theclassification models and model taxonomies and to receive user input.

For example, the document analysis platform may be configured to receivedata representing documents. The documents may be provided to thedocument analysis platform by users uploading those documents and/or bythe document analysis platform fetching or otherwise retrieving thosedocuments. While example documents as described herein may be patentsand patent applications, it should be understood that the documents maybe any type of documents that a classification analysis may be performedon. Additionally, while examples provided herein discuss the analysis oftext data from the documents, other forms of content data may also beanalyzed, such as image data, metadata, audio data, etc. Furthermore,when documents are described herein as being received and/or sent to thedocument analysis platform, it should be understood that such sendingand/or receiving includes the sending and/or receiving of datarepresenting those documents. The data representing the documents may bereceived at the document analysis platform and may be stored in adatabase associated with the document analysis platform.

The document analysis platform may be configured to display a userinterface for presenting information associated with the documentsand/or analysis of the documents. For example, the user interface mayinclude selectable portions that, when selected, may present informationassociated with the model building component and/or informationassociated with the model taxonomy component. When the model buildingcomponent is selected, the user interface may be caused to displaycategories associated with the documents and/or a classificationanalysis that is being or has been conducted. Example categories for agiven analysis may include, for example, project categories such as,“asphalt roofing production,” “natural materials,” “roofing service,”“tire recycle,” etc. In general, the categories may correspond toprojects initiated by a user of the platform and the title of thecategory may represent the subject matter and/or purpose of the project.Some or all of these categories, as displayed on the user interface, maybe selectable to allow a user to navigate between project categories tosee information associated with those project categories. Additionally,a category window may be displayed on the user interface that maypresent a title of the category, a status of the classification modelbeing utilized for determining classification of documents associatedwith the category, and an option to export the analysis and/or documentsassociated with the analysis from the user interface, a classificationwindow, an estimated model health window, a keyword window, and/or amodel application window. With respect to the category title, one ormore tags that have been determined from user input and/or from theclassification model may be displayed. The tags may provide additionalinformation about the project category and/or restrictions on theclassification determinations associated with the project category. Withrespect to the status of the classification model, this portion of thecategory window may provide a user with a visualization of which stagein the model building process this project category is associated with.For example, at the outset of a project category, a classification modelmay not be selected or trained. In these examples, the status mayindicate that no model has been selected. Once a model is selected, theuser may start providing indications of which documents are in class andwhich documents are out of class. These indications may be utilized totrain the selected model. However, depending on the amount and qualityof the user indications, the output of the trained model may not beassociated with a high confidence value. In these examples, the statusmay indicate that the model has been trained but is not yet stable. Oncethe confidence values increase as the model is retrained, the status mayindicate that the model is stable.

The classification window may be configured to display informationassociated with the number of uploaded or otherwise accessed documentsassociated with the project category as well as the number of thosedocuments that have been labeled as in class or out of class by a user.The classification window may also be configured with a selectableportion for allowing users to upload documents. The classificationwindow may also include an option to view a list of the documents thathave been upload as well as an option to start classifying the documentsthat have been uploaded. Additional details on the list view and theuser interface for classifying documents will be described in detailelsewhere herein.

The estimated model health window may be configured to display anindication of the number of the documents that have been labeled as inclass and the number of the documents that have been labeled as out ofclass. As described more fully herein, the user may utilize the documentanalysis platform to display a given document and/or portions of a givendocument. The user interface displaying the document may also includeclassifying options, which may be selectable to indicate whether thedocument being displayed should be labeled as “in class,” correspondingto a relevant document, or “out of class,” corresponding to anirrelevant document. Other options may include, for example, an “undo”option that may be utilized to undo labelling of a document, and a“skip” option which may be utilized when the user does not desire toprovide an in or out indication for a given document. As documents arelabeled, the number of labeled documents increases and that informationis displayed in the estimated model health window. The estimated modelhealth window may also be configured to display an indication of thenumber of the documents that were predicted to be in class by theclassification model and the number of the documents that were predictedto be out of class by the classification model. As described more fullyherein, the classification model may be trained utilizing the labeleddocuments. For example, a positive training dataset associated with thedocuments labeled “in” may be generated and, in examples, a negativetraining dataset associated with the documents labeled “out” may begenerated. These datasets may be utilized to train the classificationmodel how to identify, for the documents that have not been labeled,whether a given document is in class or out of class. This informationmay be displayed in the estimated model health window.

The estimated model health window may also be configured to display ascore trend indicator, which may indicate a confidence value associatedwith utilizing an instance of the classification model to predictclassification of the unlabeled documents. For example, a first set ofuser input indicating classification of a first set of the documents maybe received and utilized to train the classification model. Theclassification model may be run and a confidence value associated withpredictions made by the classification model may be determined.Thereafter, a second set of user input indicating classification ofadditional ones of the documents may be received and the classificationmodel may be retrained utilizing the second set of user input. Thisretrained instance of the classification model may be run and anotherconfidence value associated with predictions made by the retrainedclassification model may be determined. The score trend indication maydisplay the confidence values as they change from run to run and mayprovide an indication of whether the confidence values are increasing,remaining constant, or decreasing. Such an indication may provide a userwith a gauge of the impact of a given set of user inputs for trainingthe model and whether those inputs are improving or hindering themodel's ability to predict classification for the documents at issue.

The estimated model health window may also be configured to display astopping criteria indicator, which may indicate a marginal benefit ofreceiving additional user input for model training. For example, when amodel is initially trained and run, a first number of the documents willbe predicted as in and a second number of the documents will bepredicted as out. When the model is retrained using additional labelinginformation and run, the number of documents predicted as in and thenumber of documents predicted as out may change. This process maycontinue as additional labeling information is obtained and the model isretrained. However, it may be beneficial to display for a user thestopping criteria indicator, which may indicate how the number of in andout predictions differs from run to run of the model. For example, thestopping indicator may show that a last retraining and running of themodel did not change the number of documents labeled in and out, and/orthat the change was slight. In these examples, the user may utilize thestopping criteria indicator to determine that additional labeling toimprove the model's ability to predict classification is not warranted.

The estimated model health window may also be configured to display anoption to revert the model to a previous version and an option to trainthe model based on new labeling information. For example, when the scoretrend indicates a drop in confidence from retraining a model based on agiven set of user input, the option to revert may be selected and theprevious version of the model may be identified as the current model.The option to revert may also, in examples, remove the labeling ofdocuments associated with that model. The training option may beutilized to instruct the document analysis platform to retrain the modelbased at least in part on user input received since the model was lasttrained. Upon retraining the model, the user interface may be configuredto enable the model application window to provide functionality for theuser to select an option to run the model as trained or otherwisepredict classification of the documents in the document set. When thepredict option is selected, the documents that have not been labeled maybe analyzed to determine whether to mark those documents as in class orout of class. When this occurs, the estimated model health window may beupdated, such as by changing the number of documents predicted to be inand out, the score trend indicator, and/or the stopping criteriaindicator.

In addition to the above, the model keywords window may provide a visualindication of the keywords that the model has determined to be includedas in class and those keywords that the model has determined to beexcluded as out of class. The presentation of these keywords may takeone or more forms, such as a word cloud and/or a table. In a word cloud,the size, font, emphasis, and spacing of the keywords from each othermay indicate the relative importance of a given keyword to the includedand excluded groupings. For example, a keyword located in the center ofthe word cloud with larger, darker, more emphasized font than otherkeywords may be the most relevant keyword to the grouping. In the tableview, keywords may be ranked and the more relevant keyword may beidentified as the first keyword in the table.

The document analysis platform, as described herein, may be hosted orotherwise utilized by a system that may be connected to one or moreother systems and/or devices. For example, the system may be configuredto receive, over a network, the documents from a third-party system thatincludes a document database that stores data representing thedocuments. The platform may also be configured to receive, over thenetwork, data representing the documents from one or more clientdevices, which may be computing devices configured to access theInternet, display information, and receive user input. The clientdevices may include the one or more user interfaces described hereinand/or may include an application configured to instruct processors ofthe client devices to display user interfaces as provided by the systemassociated with the document analysis platform, such as via anapplication residing on memory of the client devices and/or via anInternet browser. The client devices may receive user input, such asuser input from the user interfaces, and may provide user input datacorresponding to that user input to the system associated with thedocument analysis platform. The system may utilize that user input datafor the various operations described herein. The model buildingcomponent and the model library component, as described herein, may bestored in memory of the system and may be utilized to trainclassification model, predict document classification, and search formodels, for example.

With respect to the model builder component, the document analysisplatform may include a user interface configured to display a summary ofinformation associated with individual ones of the documents of a givendocument set. For example, the user interface may include portions ofthe documents and information associated with the documents, such aswhether the document has been labeled, a prediction made in associationwith the document, a confidence value associated with the prediction,and/or an evaluation of the document and/or a portion of the document.In the example where the documents are patents and patent applications,the user interface may display portions of the documents such as thepublication number, the title, the abstract, one or more claims, and/ora claim score. The claim score may be based at least in part on ananalysis of the claims of a given patent and the claim score may beprovided by way of a scale, such as a scale from 0 to 5 where 0represents the broadest claim score and 5 represents the narrowest claimscore. The user interface may also provide some information about theproject category associated with the documents, such as the categorytitle, a category progress indicating how many documents have beenlabeled and/or predicted to be in class and out of class, how manydocuments have been skipped, and/or a total number of uploadeddocuments. The user interface may also provide options for viewing thedocument summaries, such as an option to filter the summaries based atleast in part on one or more of the attributes of the documents and/orthe analysis of the documents. The options may also include the abilityto sort the summaries based at least in part on one or more of theattributes of the documents and/or the analysis of the documents. Theoptions may also include the ability to remove or add columns ofinformation to the summaries and/or the option to take an action inassociation with a document, such as tagging a document, removing adocument, editing a document, etc. In addition, the user interface mayinclude selectable portions associated with some or each of the documentsummaries that, when selected, may cause another user interface todisplay the full document associated with the selected portion.

The summary user interface may also provide one or more indications ofrelationships between documents in the document set. For example, whenthe documents are patents and patent applications, the summary userinterface may display indications of relationships between variouspatents and patent applications. For example, patents and patentapplications may be related as “families,” meaning that the patents andpatent applications have some relationship, generally associated withpriority dates. For example, a given application may be a continuation,divisional, and/or continuation-in-part application of anotherapplication. In other examples, a given application may be a foreigncounterpart or a Patent Cooperation Treaty (PCT) application of anotherapplication. The summary user interface may provide an indication ofsuch relationships, such as by grouping the documents in a familytogether and/or providing a visual indicator of the relationship, suchas a box surrounding the summaries of the documents in a given family.In these examples, each of the documents in a given family may bepredicted to be in class or out of class based at least in part on oneof those documents being predicted to be in class or out of class,respectively. Additionally, when one document in a family is labeled bya user as in class or out of class, the document analysis platform mayautomatically label the other documents in the family accordingly.

The full document user interface may include some of the sameinformation as described in the summary user interface. That informationmay include the document title, publication number, abstract, claims,and category notes such as the number of documents marked in class andout of class, the number of documents skipped, the number of documentsthat have been labeled, and analysis details of the document. Theanalysis details may include the prediction made with respect to thedocument, such as whether a classification model determined that thedocument was in class or out of class, a confidence value associatedwith that determination, and a claim score associated with the claims ofthe document. In addition to the above, the full document user interfacemay provide a voting window that may allow a user to provide user inputindicating whether the document should be labeled as in class orotherwise “in class” or irrelevant or otherwise “out of class.”Additional options may include “skip” and “undo” for example. The votingwindow may also be utilized to present one or more of the keywords, toenable “hotkeys” or otherwise shortcut keys to allow for the user inputvia a keyboard or similar device as opposed to a mouse scrolling andclicking one of the options, and an option to utilize uncertaintysampling.

Once documents are labeled, such as via user input as described above,one or more processes may be performed to predict the classification ofother documents in a document set. For example, the system may receiveuser input data indicating in class documents and out of class documentsfrom a subset of first documents. For example, if the first documentsinclude 1,000 documents, the user input data may indicatingclassification for a subset, such as 20, of those documents. The systemmay then utilize that user input data to train a classification model,such that the classification model is configured to determine whether agiven document is more similar to those documents marked in class ormore similar to those documents marked out of class. Utilizing theclassification model, as trained, the system may predict theclassification of the remainder of the first documents that were notlabeled by the user input. Each or some of the predictions for theremainder of the documents may be associated with a confidence valueindicating how confident the system is that the classification modelaccurately determined the classification of a given document. Athreshold confidence value may be determined and the system maydetermine whether an overall confidence value associated with theclassification model satisfies that threshold confidence value. Ininstances where the confidence value does not satisfy the thresholdconfidence value, the system may cause an indication of thisdetermination to be displayed and may request additional user input datafor retraining the classification model. In instances where theconfidence value satisfies the threshold confidence value, the systemmay receive second documents for classification prediction. The seconddocuments may be received based at least in part on a user uploadingadditional documents and/or from the system retrieving additionaldocuments from one or more databases. The classification model may thenbe utilized to predict classification of this second document set.

As described herein, the document analysis platform may be configured toreceive user input data associated with classification of givendocuments. To train the classification models utilizing this user inputdata, the document analysis platform may perform one or more operations.In some examples, the platform may generate a positive training datasetindicating in class keywords associated with the documents marked inclass by a user. For example, the platform may determine one or morekeywords associated with a given document that represent the subjectmatter of that document. This may be performed utilizing one or moredocument processing techniques, such as term frequency inverse documentfrequency techniques, for example. The platform may also generate anegative training dataset indicating keywords from the documents markedout of class by the user input. Each of these training datasets may thenbe utilized to train the classification model such that theclassification model is configured to determine whether a given documenthas keywords that are more similar to the in class keywords than to theout of class keywords. In other examples, instead of or in addition togenerating training datasets based on keywords, the platform maydetermine a vector for a given document. The vector may be associatedwith a coordinate system and may represent the subject matter of thedocument in the form of a vector. Vectors may be generated for thedocuments labeled in class and for the documents labeled out of class.The classification model may be trained to determine whether a vectorrepresentation of a given document is closer to the in class vectorsthan to the out of class vectors in the coordinate system. Techniques togenerate vectors representing documents may include vectorizationtechniques such as Doc2Vec, or other similar techniques.

Additionally, or alternatively, documents representations may include amethod that takes a document and turns it into a vector form as a listof floating point numbers based at least in part on the document's textcontents. This vector form may be called an embedding. This embeddingmay be used to calculate distance, and therefore similarity, betweendocuments. These embeddings could be used in association with theclassification models in addition to or in replacement of the keywordand/or vectors described above. The embeddings may be utilized to createthematic groups of documents with a set. The set of documents can besome keyword, CPC, owner(s), etc., and the result may be a visualdisplay of document groups (e.g., clusters) that share similar themes.There may be a degree of supervision in the clustering process that mayallow for some human control over which documents are grouped in whichclusters.

In further examples, the classification models may utilize transferlearning. In these examples, a general-purpose model may be generatedand/or received, and each specific classification model may use thegeneral purpose model as a starting point. Rather than having to train aclassification model from scratch, the model would be fine-tuned fromthe general purpose model for whatever that model has not already beentrained for with respect to the specific scenario being modeled. Thesetransfer learning techniques may include the user of ULMFit, BERT, ELMo,and T5, among others.

In addition to the techniques for training the classification modelsdescribed above, the classification models may also be trained and/ororganized based at least in part on classifications of the documents.For example, when the documents are patents and patent applications, apredetermined classification system may be established for classifyingthe subject matter of a given document. The classification system may bedetermined by the platform, by one or more users, and/or by a thirdparty. For example, patents and patent application may be associatedwith a predefined classification system such as the Cooperative PatentClassification (CPC) system. The CPC system employs CPC codes thatcorrespond to differing subject matter, as described in more detailherein. The CPC codes for a given document may be identified and thecategories associated with those codes may be determined. A userinterface may be presented to the user that presents the determinedcategories and allows a user to select which categories the user findsin class for a given purpose. The selected categories may be utilized asa feature for training the classification models. Additionally, oralternatively, the platform may determine the CPC codes for documentsmarked as in class and may train the classification models to comparethose CPC codes with the CPC codes associated with the documents to beanalyzed to determine classification.

In addition to the training of classification models as described above,once the classification models are trained such that the models aredetermined to accurately predict classification as trained, the modelsmay be placed in a model taxonomy. The model taxonomy may represent ataxonomy tree or otherwise a model hierarchy indicating relationshipsbetween models and/or a level of specificity associated with the models.For example, a model associated with determining whether documents arein class with respect to “computers,” may be associated with othermodels trained to determine whether documents are in class with respectto “processors,” “memory,” and “keyboards,” respectively. Each of thesemodels may also be associated with other models trained to determinemore specific aspects of these components, such as “microprocessors” and“processor components,” or “RAM” and “partitioned memory.” This taxonomymay be searchable and may provide functionality that allows a user toprovide a search query for a model. The keywords from the search querymay be utilized to identify models that may be applicable to the searchquery and/or to highlight “branches” of the taxonomy associated with thesearch query.

A user interface may be utilized to display indications of the modelsidentified during a model search, and the user interface may beconfigured to receive user input indicating selection of a given modelfor use in determining classification of documents. The user and/or theplatform may then upload the document set to be analyzed and theselected model may be utilized to predict classification of individualones of the documents. A user interface indicating the results of theclassification determination as performed utilizing the selected modelmay be displayed as well as a confidence value associated with theaccuracy of the model in determining classification. This may providethe user with an indication of whether the previously-trained model issufficient for analyzing the documents at issue, or whether anothermodel should be selected, or a new model should be trained.

The model taxonomy may also provide an indication of where models havenot been trained for a given subject matter. For example, a “slot” onthe model taxonomy may be left blank or may indicate that a model hasnot yet been trained. The slots may be determined from the predefinedclassification system, such as by CPC codes. This may provide a user anindication of whether selecting an already-trained model would bepreferable to training a new model. The model taxonomy may also providean indication of how closely related models are on the hierarchy. Thisindication may be presented by way of lines between “nodes” in thetaxonomy, where the length of a line may indicate how closely the modelsare related to each other. In addition, when a user query for use of amodel is received, the model and/or models that most closely match upwith the search query may be identified. The platform may determinewhether at least one of the resulting models has keywords that aresufficiently similar to the keywords in the search query. In exampleswhere there is sufficient similarity, indicators of those models may bepresented as results to the user. In examples where there isinsufficient similarity, the user interface may return resultsindicating that no models in the model taxonomy are sufficient in lightof the search query, and may request that the user perform theoperations associated with training a new model.

The present disclosure provides an overall understanding of theprinciples of the structure, function, manufacture, and use of thesystems and methods disclosed herein. One or more examples of thepresent disclosure are illustrated in the accompanying drawings. Thoseof ordinary skill in the art will understand that the systems andmethods specifically described herein and illustrated in theaccompanying drawings are non-limiting embodiments. The featuresillustrated or described in connection with one embodiment may becombined with the features of other embodiments, including as betweensystems and methods. Such modifications and variations are intended tobe included within the scope of the appended claims.

Additional details are described below with reference to several exampleembodiments.

FIG. 1 illustrates a conceptual diagram of a user interface 100 that mayreceive documents and utilize a document analysis platform to analyzethe documents. The user interface 100 may be displayed on a display ofan electronic device, such as the electronic device 202 as describedwith respect to FIG. 2 below.

For example, the document analysis platform may be configured to receivedata representing documents 102. The documents 102 may be provided tothe document analysis platform by users uploading those documents 102and/or by the document analysis platform fetching or otherwiseretrieving those documents 102. While example documents 102 as describedherein may be patents and patent applications, it should be understoodthat the documents 102 may be any type of documents 102 that aclassification analysis may be performed on. Additionally, whileexamples provided herein discuss the analysis of text data from thedocuments 102, other forms of content data may also be analyzed, such asimage data, metadata, audio data, etc. The data representing thedocuments 102 may be received at the document analysis platform and maybe stored in a database associated with the document analysis platform.

The document analysis platform may be configured to display the userinterface 100 for presenting information associated with the documents102 and/or analysis of the documents 102. For example, the userinterface 100 may include selectable portions that, when selected, maypresent information associated with a model building component of thedocument analysis platform and/or information associated with a modeltaxonomy component of the document analysis platform. When the modelbuilding component is selected, the user interface 100 may be caused todisplay categories 104 associated with the documents 102 and/or aclassification analysis that is being or has been conducted. Examplecategories 104 for a given analysis may include, for example, projectcategories such as, “asphalt roofing production,” “natural materials,”“roofing service,” “tire recycle,” etc. A shown in FIG. 1 , the examplecategories 104 are “Category 1,” “Category 2,” and “Category 3.” Some orall of these categories 104, as displayed on the user interface 100, maybe selectable to allow a user to navigate between project categories tosee information associated with those project categories. Additionally,a category window 106 may be displayed on the user interface 100 thatmay present a title of the category 104, a status of the classificationmodel being utilized for determining classification of documents 102associated with the category 106, and an option to export the analysisand/or documents 102 associated with the analysis from the userinterface 100, a classification window 108, an estimated model healthwindow 110, a keyword window 112, and/or a model application window 114.With respect to the category title, one or more tags that have beendetermined from user input and/or from the classification model may bedisplayed. The tags may provide additional information about the projectcategory 104 and/or restrictions on the classification determinationsassociated with the project category 104. With respect to the status ofthe classification model, this portion of the category window 106 mayprovide a user with a visualization of which stage in the model buildingprocess this project category 104 is associated with. For example, atthe outset of a project category 104, a classification model may not beselected or trained. In these examples, the status may indicate that nomodel has been selected. Once a model is selected, the user may startproviding indications of which documents 102 are in class and whichdocuments 102 are out of class. These indications may be utilized totrain the selected model. However, depending on the amount and qualityof the user indications, the output of the trained model may not beassociated with a high confidence value. In these examples, the statusmay indicate that the model has been trained but is not yet stable. Oncethe confidence values increase as the model is retrained, the status mayindicate that the model is stable.

The classification window 108 may be configured to display informationassociated with the number of uploaded or otherwise accessed documents102 associated with the project category 104 as well as the number ofthose documents 102 that have been labeled as in class or out of classby a user. The classification window 108 may also be configured with aselectable portion for allowing users to upload documents 102. Theclassification window 108 may also include an option to view a list ofthe documents 102 that have been upload as well as an option to startclassifying the documents 102 that have been uploaded. Additionaldetails on the list view and the user interface for classifyingdocuments 102 will be described in detail elsewhere herein.

The estimated model health window 110 may be configured to display anindication of the number of the documents 102 that have been labeled asin class and the number of the documents 102 that have been labeled asout of class. As described more fully herein, the user may utilize thedocument analysis platform to display a given document 102 and/orportions of a given document 102. The user interface displaying thedocument 102 may also include classifying options, which may beselectable to indicate whether the document 102 being displayed shouldbe labeled as “in,” corresponding to a relevant document, or “out,”corresponding to an out of class document. Other options may include,for example, an “undo” option that may be utilized to undo labelling ofa document 102, and a “skip” option which may be utilized when the userdoes not desire to provide an in or out indication for a given document102. As documents 102 are labeled, the number of labeled documentsincreases and that information is displayed in the estimated modelhealth window 110. The estimated model health window 110 may also beconfigured to display an indication of the number of the documents 102that were predicted to be in class by the classification model and thenumber of the documents 102 that were predicted to be out of class bythe classification model. As described more fully herein, theclassification model may be trained utilizing the labeled documents. Forexample, a positive training dataset associated with the documentslabeled “in” may be generated and, in examples, a negative trainingdataset associated with the documents labeled “out” may be generated.These datasets may be utilized to train the classification model how toidentify, for the documents 102 that have not been labeled, whether agiven document 102 is in class or out of class. This information may bedisplayed in the estimated model health window 110.

The estimated model health window 110 may also be configured to displaya score trend indicator, which may indicate a confidence valueassociated with utilizing an instance of the classification model topredict classification of the unlabeled documents. For example, a firstset of user input indicating classification of a first set of thedocuments 102 may be received and utilized to train the classificationmodel. The classification model may be run and a confidence valueassociated with predictions made by the classification model may bedetermined. Thereafter, a second set of user input indicatingclassification of additional ones of the documents 102 may be receivedand the classification model may be retrained utilizing the second setof user input. This retrained instance of the classification model maybe run and another confidence value associated with predictions made bythe retrained classification model may be determined. The score trendindication may display the confidence values as they change from run torun and may provide an indication about whether the confidence valuesare increasing, remaining constant, or decreasing. Such an indicationmay provide a user with a gauge of the impact of a given set of userinputs for training the model and whether those inputs are improving orhindering the model's ability to predict classification for thedocuments 102 at issue.

The estimated model health window 110 may also be configured to displaya stopping criteria indicator, which may indicate a marginal benefit ofreceiving additional user input for model training. For example, when amodel is initially trained and run, a first number of the documents 102will be predicted as “in” and a second number of the documents 102 willbe predicted as “out.” When the model is retrained using additionallabeling information and run, the number of documents 102 predicted as“in” and the number of documents 102 predicted as “out” may change. Thisprocess may continue as additional labeling information is obtained andthe model is retrained. However, it may be beneficial to display for auser the stopping criteria indicator, which may indicate how the numberof “in” and “out” predictions differs from run to run of the model. Forexample, the stopping criteria indicator may show that a last retrainingand running of the model did not change the number of documents 102labeled “in” and “out,” and/or that the change was slight. In theseexamples, the user may utilize the stopping criteria indicator todetermine that additional labeling to improve the model's ability topredict classification is not warranted.

The estimated model health window 110 may also be configured to displayan option to revert the model to a previous version and an option totrain the model based on new labeling information. For example, when thescore trend indicates a decrease in confidence from retraining a modelbased on a given set of user input, the option to revert may be selectedand the previous version of the model may be identified as the currentmodel. The option to revert may also, in examples, remove the labelingof documents 102 associated with that model. The training option may beutilized to instruct the document analysis platform to retrain the modelbased at least in part on user input received since the model was lasttrained. Upon retraining the model, the user interface 100 may beconfigured to enable the model application window 114 to providefunctionality for the user to select an option to run the model astrained or otherwise predict classification of the documents 102 in thedocument set. When the predict option is selected, the documents 102that have not been labeled may be analyzed to determine whether to markthose documents 102 as in class or out of class. When this occurs, theestimated model health window 110 may be updated, such as by changingthe number of documents 102 predicted to be “in” and “out,” the scoretrend indicator, and/or the stopping criteria indicator.

In addition to the above, the model keywords window 112 may provide avisual indication of the keywords that the model has determined to beincluded as in class and those keywords that the model has determined tobe excluded as out of class. The presentation of these keywords may takeone or more forms, such as a word cloud and/or a table. In a word cloud,the size, font, emphasis, and spacing of the keywords from each othermay indicate the relative importance of a given keyword to the includedand excluded groupings. For example, a keyword located in the center ofthe word cloud with larger, darker, more emphasized font than otherkeywords may be the most relevant keyword to the grouping. In the tableview, keywords may be ranked and the more relevant keyword may beidentified as the first keyword in the table.

FIG. 2 illustrates a schematic diagram of an example environment 200 fora document analysis architecture. The architecture 200 may include, forexample, one or more client-side devices 202, also described herein aselectronic devices 202, a document analysis system 204 associated with adocument analysis platform, and/or a document database system 206associated with one or more document databases 208. Some or all of thedevices and systems may be configured to communicate with each other viaa network 210.

The electronic devices 202 may include components such as, for example,one or more processors 212, one or more network interfaces 214, and/ormemory 216. The memory 216 may include components such as, for example,one or more user interfaces 218 and/or one or more document databases220. As shown in FIG. 2 , the electronic devices 202 may include, forexample, a computing device, a mobile phone, a tablet, a laptop, and/orone or more servers. The components of the electronic device 202 will bedescribed below by way of example. It should be understood that theexample provided herein is illustrative, and should not be consideredthe exclusive example of the components of the electronic device 202.

By way of example, the user interface(s) 218 include one or more of theuser interfaces described elsewhere herein, such as the user interface100 corresponding to a model builder user interface, a document summaryuser interface, a full document user interface, user interfaces utilizedfor document voting, confidence value user interfaces, keyword userinterfaces, search query user interfaces, model taxonomy userinterfaces, etc. It should be understood that while the user interfaces218 are depicted as being a component of the memory 216 of theclient-side devices 202, the user interfaces 218 may additionally oralternatively be associated with the document analysis system 204. Theuser interfaces 218 may be configured to display information associatedwith the document analysis platform and to receive user input associatedwith the document analysis platform. The document databases 220 of theclient-side device 202, and/or the document databases 208 of thedocument database system 206 may include data corresponding to documentsthat a user may desire to be analyzed using the document analysisplatform. Those documents may include, for example, patents and patentapplications, and/or the documents may include non-patent documents. Thedocuments may be stored with respect to the document databases 208 ofthe document database system 206 and/or the documents may be stored withrespect to the document databases 220 of the client-side devices 202.

The document analysis system 204 may include one or more components suchas, for example, one or more processors 222, one or more networkinterfaces 224, and/or memory 226. The memory 226 may include one ormore components such as, for example, a model builder component 228and/or a model taxonomy component 230. The model builder component 228may be configured to receive user input data as described herein forlabelling documents as in class or out of class. The model buildercomponent 228 may also be configured to utilize the user input data, aswell as other data associated with a document set in question, to trainclassification models for determining the classification of a givendocument. The model builder component 228 may also be configured toutilize the trained classification models to predict documentclassification and to display results of the use of the classificationmodels. The model taxonomy component 230 may be configured to generateand utilize a model taxonomy including the trained classificationmodels. The model taxonomy component 230 may also be configured toreceive user input data representing user queries for use ofclassification models and to display search results to the search queryindicating one or more models associated with the search query.

As shown in FIG. 2 , several of the components of the document analysissystem 204 and/or the client-side devices 202 and the associatedfunctionality of those components as described herein may be performedby one or more of the other systems and/or by the client-side devices202. Additionally, or alternatively, some or all of the componentsand/or functionalities associated with the client-side devices 202 maybe performed by the document analysis system 204.

It should be noted that the exchange of data and/or information asdescribed herein may be performed only in situations where a user hasprovided consent for the exchange of such information. For example, auser may be provided with the opportunity to opt in and/or opt out ofdata exchanges between devices and/or with the remote systems and/or forperformance of the functionalities described herein. Additionally, whenone of the devices is associated with a first user account and anotherof the devices is associated with a second user account, user consentmay be obtained before performing some, any, or all of the operationsand/or processes described herein.

As used herein, a processor, such as processor(s) 112 and/or 222, mayinclude multiple processors and/or a processor having multiple cores.Further, the processors may comprise one or more cores of differenttypes. For example, the processors may include application processorunits, graphic processing units, and so forth. In one implementation,the processor may comprise a microcontroller and/or a microprocessor.The processor(s) 112 and/or 222 may include a graphics processing unit(GPU), a microprocessor, a digital signal processor or other processingunits or components known in the art. Alternatively, or in addition, thefunctionally described herein can be performed, at least in part, by oneor more hardware logic components. For example, and without limitation,illustrative types of hardware logic components that can be used includefield-programmable gate arrays (FPGAs), application-specific integratedcircuits (ASICs), application-specific standard products (ASSPs),system-on-a-chip systems (SOCs), complex programmable logic devices(CPLDs), etc. Additionally, each of the processor(s) 112 and/or 222 maypossess its own local memory, which also may store program components,program data, and/or one or more operating systems.

The memory 116 and/or 226 may include volatile and nonvolatile memory,removable and non-removable media implemented in any method ortechnology for storage of information, such as computer-readableinstructions, data structures, program component, or other data. Suchmemory 116 and/or 226 includes, but is not limited to, RAM, ROM, EEPROM,flash memory or other memory technology, CD-ROM, digital versatile disks(DVD) or other optical storage, magnetic cassettes, magnetic tape,magnetic disk storage or other magnetic storage devices, RAID storagesystems, or any other medium which can be used to store the desiredinformation and which can be accessed by a computing device. The memory116 and/or 226 may be implemented as computer-readable storage media(“CRSM”), which may be any available physical media accessible by theprocessor(s) 112 and/or 222 to execute instructions stored on the memory116 and/or 226. In one basic implementation, CRSM may include randomaccess memory (“RAM”) and Flash memory. In other implementations, CRSMmay include, but is not limited to, read-only memory (“ROM”),electrically erasable programmable read-only memory (“EEPROM”), or anyother tangible medium which can be used to store the desired informationand which can be accessed by the processor(s).

Further, functional components may be stored in the respective memories,or the same functionality may alternatively be implemented in hardware,firmware, application specific integrated circuits, field programmablegate arrays, or as a system on a chip (SoC). In addition, while notillustrated, each respective memory, such as memory 116 and/or 226,discussed herein may include at least one operating system (OS)component that is configured to manage hardware resource devices such asthe network interface(s), the I/O devices of the respective apparatuses,and so forth, and provide various services to applications or componentsexecuting on the processors. Such OS component may implement a variantof the FreeBSD operating system as promulgated by the FreeBSD Project;other UNIX or UNIX-like variants; a variation of the Linux operatingsystem as promulgated by Linus Torvalds; the FireOS operating systemfrom Amazon.com Inc. of Seattle, Wash., USA; the Windows operatingsystem from Microsoft Corporation of Redmond, Wash., USA; LynxOS aspromulgated by Lynx Software Technologies, Inc. of San Jose, Calif.;Operating System Embedded (Enea OSE) as promulgated by ENEA AB ofSweden; and so forth.

The network interface(s) 114 and/or 224 may enable messages between thecomponents and/or devices shown in system 100 and/or with one or moreother remote systems, as well as other networked devices. Such networkinterface(s) 114 and/or 224 may include one or more network interfacecontrollers (NICs) or other types of transceiver devices to send andreceive messages over the network 210.

For instance, each of the network interface(s) 114 and/or 224 mayinclude a personal area network (PAN) component to enable messages overone or more short-range wireless message channels. For instance, the PANcomponent may enable messages compliant with at least one of thefollowing standards IEEE 802.15.4 (ZigBee), IEEE 802.15.1 (Bluetooth),IEEE 802.11 (WiFi), or any other PAN message protocol. Furthermore, eachof the network interface(s) 114 and/or 224 may include a wide areanetwork (WAN) component to enable message over a wide area network.

In some instances, the document analysis system 204 may be local to anenvironment associated the electronic device 202. For instance, thedocument analysis system 204 may be located within the electronic device202. In some instances, some or all of the functionality of the documentanalysis system 204 may be performed by the electronic device 202. Also,while various components of the document analysis system 204 have beenlabeled and named in this disclosure and each component has beendescribed as being configured to cause the processor(s) to performcertain operations, it should be understood that the describedoperations may be performed by some or all of the components and/orother components not specifically illustrated.

FIG. 3A illustrates an example user interface 300 displaying documentdata and user input data for classification model building. The userinterface 300 depicted in FIG. 3A may also be described as a summaryuser interface 300 herein.

For example, the summary user interface 300 may include portions of thedocuments and information associated with the documents, such as whetherthe documents have been labeled, a prediction made in association withthe document, a confidence value associated with the prediction, and/oran evaluation of the document and/or a portion of the document. In theexample where the documents are patents and patent applications, theuser interface 300 may display portions of the documents such as thepublication number, the title, the abstract, one or more claims, and/ora claim score. The claim score may be based at least in part on ananalysis of the claims of a given patent and the claim score may beprovided by way of a scale, such as a scale from 0 to 5, where 0represents the broadest claim score and 5 represents the narrowest claimscore. The user interface 300 may also provide some information aboutthe project category associated with the documents, such as the categorytitle, a category progress indicating how many documents have beenlabeled and/or predicted to be in class and out of class, how manydocuments have been skipped, and/or a total number of uploadeddocuments. For example, as depicted in FIG. 3A, the category is“Category A” and the category progress is “929 In,” “5697 Out,” “0Skipped,” and “550 Unlabeled.” The user interface 300 may also provideoptions for viewing the document summaries, such as a filter option 302to filter the summaries based at least in part on one or more of theattributes of the documents and/or the analysis of the documents. Theoptions may also include a sort options 304, which may be utilized tosort the summaries based at least in part on one or more of theattributes of the documents and/or the analysis of the documents. Theoptions may also a columns option 306, which may be utilized to removeor add columns of information to the summaries. The options may alsoinclude an action option 308, which may be utilized to take an action inassociation with a document, such as tagging a document, removing adocument, editing a document, etc. In addition, the user interface 300may include selectable portions 310 associated with some or each of thedocument summaries that, when selected, may cause another user interfaceto display the full document associated with the selected portion.

FIG. 3B illustrates an example user interface 350 with functionality fordisplaying document data and for accepting user input. The userinterface 350 depicted in FIG. 3B may also be described as a fulldocument user interface 350 herein.

The full document user interface 350 may include some of the sameinformation as described in the summary user interface 300. Thatinformation may include the document title, publication number,abstract, claims, and category notes such as the number of documentsmarked in class and out of class, the number of documents skipped, thenumber of documents that have been labeled, and analysis details of thedocument. The full document user interface 350 may provide additionalinformation regarding some or all of the aspects of a given document.For example, additional portions of the abstract and/or additionalclaims and/or claim language may be displayed in the full document userinterface 350. Additionally, the category progress information andanalysis details may be displayed in a category notes window 352. Theanalysis details may include the prediction made with respect to thedocument, such as whether a classification model determined that thedocument was in class or out of class, a confidence value associatedwith that determination, and a claim score associated with the claims ofthe document. For example, as shown in FIG. 3B, the prediction made withrespect to this document is “in,” and the confidence value associatedwith that determination is 0.988. As will be described in more detailelsewhere herein, the confidence value may be on a scale of 0.0 to 1.0,with 1.0 indicating complete confidence that a document is markedcorrectly. As such, a confidence value of 0.988 represents a nearcertain determination that the document is marked correctly. The claimscore in FIG. 3B is “0,” indicating the most favorable claim score onthe 0 to 5 scale. The claim score may be determined in a number of ways,such as based at least in part on the number of words in a given claim,the terminology used in a claim, the support for the claim language inthe detailed description of the patent, the presence of negativelimitations and/or modifiers, the number of recited operations, etc.

In addition to the above, the full document user interface 300 mayprovide a voting window 354 that may allow a user to provide user inputindicating whether the document should be labeled as relevant orotherwise “in class” or irrelevant or otherwise “out of class.”Additional options may include “skip” and “undo” for example. The votingwindow 354 may also be utilized to present one or more of the keywordsto enable “hotkeys” or otherwise shortcut keys to allow for the userinput via a keyboard or similar device as opposed to a mouse scrollingand clicking one of the options, and an option to utilize uncertaintysampling. For example, a user may view the information about thedocument in the full document user interface 350. After review of someor all of the information being displayed, the user may determine thatthe document is either in class or out of class (or determine that thedocument is to be skipped). In examples where the document is to belabeled as in class, the user may utilize one or more input means toselect a portion of the screen corresponding to the “in” option. Inexamples where the document is to be labeled as out of class, the usermay utilize one or more input means to select a portion of the screencorresponding to the “out” option. Alternatively, when hotkeys areenabled, the user may select the corresponding hotkey on a keyboard(whether physical or digital). Upon selection of one of the options inthe voting window 354, the user interface 350 may be caused to displaythe next unlabeled document in the document set to allow the user toreview that document and provide user input associated with theclassification of that document.

FIG. 4 illustrates an example user interface 100 displaying informationassociated with use of a classification model for determining whichportions of a document set are in class.

For example, the document analysis platform may be configured to receivedata representing documents. The documents may be provided to thedocument analysis platform by users uploading those documents and/or bythe document analysis platform fetching or otherwise retrieving thosedocuments. While example documents as described herein may be patentsand patent applications, it should be understood that the documents maybe any type of documents that a classification analysis may be performedon. Additionally, while examples provided herein discuss the analysis oftext data from the documents, other forms of content data may also beanalyzed, such as image data, metadata, audio data, etc. The datarepresenting the documents may be received at the document analysisplatform and may be stored in a database associated with the documentanalysis platform.

The document analysis platform may be configured to display the userinterface 100 for presenting information associated with the documentsand/or analysis of the documents. For example, the user interface 100may include selectable portions that, when selected, may presentinformation associated with a model building component of the documentanalysis platform and/or information associated with a model taxonomycomponent of the document analysis platform. When the model buildingcomponent is selected, the user interface 100 may be caused to displaycategories 104 associated with the documents and/or a classificationanalysis that is being or has been conducted. Example categories 104 fora given analysis may include, for example, project categories such as,“asphalt roofing production,” “natural materials,” “roofing service,”“tire recycle,” etc. A shown in FIG. 4 , the example categories 104 are“Category 1,” “Category 2,” and “Category 3.” Some or all of thesecategories 104, as displayed on the user interface 100, may beselectable to allow a user to navigate between project categories to seeinformation associated with those project categories. Additionally, acategory window 106 may be displayed on the user interface 100 that maypresent a title of the category 104, a status of the classificationmodel being utilized for determining classification of documentsassociated with the category 106, and an option to export the analysisand/or documents associated with the analysis from the user interface100, a classification window 108, an estimated model health window 110,a keyword window 112, and/or a model application window 114. Withrespect to the category title, one or more tags that have beendetermined from user input and/or from the classification model may bedisplayed. The tags may provide additional information about the projectcategory 104 and/or restrictions on the classification determinationsassociated with the project category 104. With respect to the status ofthe classification model, this portion of the category window 106 mayprovide a user with a visualization of which stage in the model buildingprocess this project category 104 is associated with. For example, atthe outset of a project category 104, a classification model may not beselected or trained. In these examples, the status may indicate that nomodel has been selected. Once a model is selected, the user may startproviding indications of which documents are in class and whichdocuments are out of class. These indications may be utilized to trainthe selected model. However, depending on the amount and quality of theuser indications, the output of the trained model may not be associatedwith a high confidence value. In these examples, the status may indicatethat the model has been trained but is not yet stable. Once theconfidence values increase as the model is retrained, the status mayindicate that the model is stable.

The classification window 108 may be configured to display informationassociated with the number of uploaded or otherwise accessed documentsassociated with the project category 104 as well as the number of thosedocuments that have been labeled as in class or out of class by a user.The classification window 108 may also be configured with a selectableportion for allowing users to upload documents. The classificationwindow 108 may also include an option to view a list of the documentsthat have been upload as well as an option to start classifying thedocuments that have been uploaded. Additional details on the list viewand the user interface for classifying documents will be described indetail elsewhere herein.

The estimated model health window 110 may be configured to display anindication of the number of the documents that have been labeled as inclass and the number of the documents that have been labeled as out ofclass. As described more fully herein, the user may utilize the documentanalysis platform to display a given document and/or portions of a givendocument. The user interface displaying the document may also includeclassifying options, which may be selectable to indicate whether thedocument being displayed should be labeled as “in,” corresponding to arelevant document, or “out,” corresponding to an out of class document.Other options may include, for example, an “undo” option that may beutilized to undo labelling of a document, and a “skip” option which maybe utilized when the user does not desire to provide an in or outindication for a given document. As documents are labeled, the number oflabeled documents increases and that information is displayed in theestimated model health window 110. The estimated model health window 110may also be configured to display an indication of the number of thedocuments that were predicted to be in class by the classification modeland the number of the documents that were predicted to be out of classby the classification model. As described more fully herein, theclassification model may be trained utilizing the labeled documents. Forexample, a positive training dataset associated with the documentslabeled “in” may be generated and, in examples, a negative trainingdataset associated with the documents labeled “out” may be generated.These datasets may be utilized to train the classification model how toidentify, for the documents that have not been labeled, whether a givendocument is in class or out of class. This information may be displayedin the estimated model health window 110.

The estimated model health window 110 may also be configured to displaya score trend indicator, which may indicate a confidence valueassociated with utilizing an instance of the classification model topredict classification of the unlabeled documents. For example, a firstset of user input indicating classification of a first set of thedocuments may be received and utilized to train the classificationmodel. The classification model may be run and a confidence valueassociated with predictions made by the classification model may bedetermined. Thereafter, a second set of user input indicatingclassification of additional ones of the documents may be received andthe classification model may be retrained utilizing the second set ofuser input. This retrained instance of the classification model may berun and another confidence value associated with predictions made by theretrained classification model may be determined. The score trendindication may display the confidence values as they change from run torun and may provide an indication about whether the confidence valuesare increasing, remaining constant, or decreasing. Such an indicationmay provide a user with a gauge of the impact of a given set of userinputs for training the model and whether those inputs are improving orhindering the model's ability to predict classification for thedocuments at issue.

The estimated model health window 110 may also be configured to displaya stopping criteria indicator, which may indicate a marginal benefit ofreceiving additional user input for model training. For example, when amodel is initially trained and run, a first number of the documents willbe predicted as “in” and a second number of the documents will bepredicted as “out.” When the model is retrained using additionallabeling information and run, the number of documents predicted as “in”and the number of documents predicted as “out” may change. This processmay continue as additional labeling information is obtained and themodel is retrained. However, it may be beneficial to display for a userthe stopping criteria indicator, which may indicate how the number of“in” and “out” predictions differs from run to run of the model. Forexample, the stopping criteria indicator may show that a last retrainingand running of the model did not change the number of documents labeled“in” and “out,” and/or that the change was slight. In these examples,the user may utilize the stopping criteria indicator to determine thatadditional labeling to improve the model's ability to predictclassification is not warranted.

The estimated model health window 110 may also be configured to displayan option to revert the model to a previous version and an option totrain the model based on new labeling information. For example, when thescore trend indicates a decrease in confidence from retraining a modelbased on a given set of user input, the option to revert may be selectedand the previous version of the model may be identified as the currentmodel. The option to revert may also, in examples, remove the labelingof documents associated with that model. The training option may beutilized to instruct the document analysis platform to retrain the modelbased at least in part on user input received since the model was lasttrained. Upon retraining the model, the user interface 100 may beconfigured to enable the model application window 114 to providefunctionality for the user to select an option to run the model astrained or otherwise predict classification of the documents in thedocument set. When the predict option is selected, the documents thathave not been labeled may be analyzed to determine whether to mark thosedocuments as in class or out of class. When this occurs, the estimatedmodel health window 110 may be updated, such as by changing the numberof documents predicted to be “in” and “out,” the score trend indicator,and/or the stopping criteria indicator.

In addition to the above, the model keywords window 112 may provide avisual indication of the keywords that the model has determined to beincluded as in class and those keywords that the model has determined tobe excluded as out of class. The presentation of these keywords may takeone or more forms, such as a word cloud and/or a table. In a word cloud,the size, font, emphasis, and spacing of the keywords from each othermay indicate the relative importance of a given keyword to the includedand excluded groupings. For example, a keyword located in the center ofthe word cloud with larger, darker, more emphasized font than otherkeywords may be the most relevant keyword to the grouping. In the tableview, keywords may be ranked and the more relevant keyword may beidentified as the first keyword in the table.

FIG. 5 illustrates an example user interface 500 displaying informationassociated with confidence values associated with a document analysisarchitecture. The user interface 500 may also be described herein as theconfidence value user interface 500.

The confidence value user interface 500 may display informationassociated with one or more confidence values associated with a givenclassification model. For example, the confidence value user interface500 may include a score trend 502, a stability trend 504, an overallcategory confidence trend 506, and an overall random confidence trend508. The score trend 502 may indicate, for a particular classificationmodel, confidence values for use of the model to determineclassification of documents. Each data point may represent a run of theclassification model and a determination of a confidence valueassociated with each run. In some examples, prior to a particular run,additional user input may be received indicating additional documentshave been labeled by a user as in class or out of class. This user inputmay be utilized to retrain the model. In some instances, the additionaluser input may increase the model's ability to accurately determineclassification. This may lead to an increase in the confidence valueassociated with predictions made by the model. In other instances, theadditional user input may decrease the model's ability to accuratelydetermine classification. This may lead to a decrease in the confidencevalue associated with predictions made by the model. Additionally, oralternatively, additional documents may be added to the document setprior to a given run of the model, and confidence value may differ basedat least in part on the determinations of classification made inassociation with those newly-added documents. By so doing, the scoretrend 502 may provide an indication of model health over time and atrend as to whether that model health is improving or decreasing. Thescore trend 502 may also be described as a measure of how accurate themodel is based at least in part on a calculation of an F1 score. The F1score may be a harmonic mean of the model's precision and recall. Inthese examples, a value of 1 may be considered best, while a 0 may beconsidered worst.

The stability trend 504 may indicate agreement between each versionand/or training of a model. In examples, a stability trend score of 1may be considered best, while a 0 may be considered worst. A value below0.999 may mean additional labeled data may be beneficial to make themodel successful generally at predicting classification of documents.The calculation for the stability trend 504 may be known as Cohen'skappa statistic. If the metric stabilizes over several trainings but isstill short of a recommended value, such as 0.99, an analyst mayconsider adding additional training data to the model and labeling thatdata. If the stability trend 504 is fluctuating, this may mean that auser may consider starting training over with a more targeted focus forthe model.

The overall category confidence trend 506 may indicate the averageconfidence within a category data set defined by a user. For thisoverall category trend 506, confidence for any given document may beconsidered the difference between the predicted probability of thedocument being marked in class and the predicted probability of thedocument being marked out of class. A overall category confidence trend506 value of 1 may be considered best, while a value of 0 may beconsidered worst.

The overall random confidence trend 508 may indicate confidence within acategory data set, similar to the overall category confidence trend 506,but instead using a random set of documents from a full corpus of thedocuments to indicate confidence. A overall random confidence trend 508value of 1 may be considered best, while a value of 0 may be consideredworst.

FIG. 6 illustrates an example user interface 600 displaying documentdata and indications of relationships between documents. The userinterface 600 may be the same as or similar to the summary userinterface 300 as described with respect to FIG. 3A.

The user interface 600 may provide one or more indications ofrelationships 602 between documents in the document set. For example,when the documents are patents and patent applications, the userinterface 600 may display indications of relationships 602 betweenvarious patents and patent applications. For example, patents and patentapplications may be related as “families,” meaning that the patents andpatent applications have some relationship, generally associated withpriority dates. For example, a given application may be a continuation,divisional, and/or continuation-in-part application of anotherapplication. In other examples, a given application may be a foreigncounterpart or a Patent Cooperation Treaty (PCT) application of anotherapplication. The user interface 600 may provide an indication of suchrelationships, such as by grouping the documents in a family togetherand/or providing a visual indicator of the relationship, such as a boxsurrounding the summaries of the documents in a given family. In theseexamples, each of the documents in a given family may be predicted to bein class or out of class based at least in part on one of thosedocuments being predicted to be in class or out of class, respectively.Additionally, when one document in a family is labeled by a user as inclass or out of class, the document analysis platform may automaticallylabel the other documents in the family accordingly.

FIG. 7A illustrates an example user interface 700 displaying keywordsdetermined to be in class and keywords determined to be out of class bya classification model in a word-cloud format. The user interface 700may be a portion of the model building user interface 100 and maycorrespond to at least a portion of the model keywords window 112 asdescribed with respect to FIG. 1 .

The user interface 700 may provide a visual indication of the keywordsthat a given classification model has determined to be included as inclass and those keywords that the model has determined to be excluded asout of class. For example, as described more fully elsewhere herein, themodels may utilize training datasets indicating which documents arelabeled in class and which documents are labeled out of class. Featuresof those documents may be identified that represent the documents, andthose features may be utilized to train the models. In examples, thefeatures may include keywords that represent the text of the document.The presentation of these keywords in the user interface 700 may takeone or more forms, such as a word cloud as illustrated in FIG. 7A. In aword cloud, the size, font, emphasis, and spacing of the keywords fromeach other may indicate the relative importance of a given keyword tothe included and excluded groupings. For example, the user interface 700may include an included keyword window 702 and an excluded keywordwindow 704. The included keyword window 702 may provide a visualindication of the keywords that the model has determined arerepresentative of the documents labeled as in class. The excludedkeyword window 704 may provide a visual indication of the keywords thatthe model has determined are representative of the documents labeled asout of class. The keywords may each be associated with a differentweighting value or otherwise may be more or less important fordetermining document classification. A visual indication of theseweighting values may be provided in the included keyword window 702 andthe excluded keyword window 704. For example, a keyword located in thecenter of the word cloud with larger, darker, more emphasized font thanother keywords may be the most relevant keyword to the grouping. Asshown in FIG. 7A, the example word clouds illustrate that the keyword“Word A” is most important for determining relevant documents whilekeyword “Word 1” is most important for determining out of classdocuments.

In examples, the user interface 700 may be configured to receive userinput associated with the keywords. For example, the user input mayinclude a user confirming that a keyword should be included in one ormore of the included keyword window 702 and the excluded keyword window704. The user input may also include a user indicating that a givenkeyword should be removed, deemphasized, or emphasized more than itcurrently is. User input data corresponding to the user input may beutilized to retrain the classification model. Additionally, a user mayprovide user input indicating that a word that is not included in agiven window should be included, and the classification model may beretrained based at least in part on that user input data.

FIG. 7B illustrates an example user interface 750 displaying keywordsdetermined to be in class and keywords determined to be out of class bya classification model in a list format. The user interface 750 may be aportion of the model building user interface 100 and may correspond toat least a portion of the model keywords window 112 as described withrespect to FIG. 1 .

The user interface 750 may provide a visual indication of the keywordsthat a given classification model has determined to be included as inclass and those keywords that the model has determined to be excluded asout of class. For example, as described more fully elsewhere herein, themodels may utilize training datasets indicating which documents arelabeled in class and which documents are labeled out of class. Featuresof those documents may be identified that represent the documents, andthose features may be utilized to train the models. In examples, thefeatures may include keywords that represent the text of the document.The presentation of these keywords in the user interface 750 may takeone or more forms, such as a list format as illustrated in FIG. 7B. Inthe list format, keywords may be ranked and the more relevant keywordmay be identified as the first keyword in the table. For example, theuser interface 750 may include an included keyword window 702 and anexcluded keyword window 704. The included keyword window 702 may providea visual indication of the keywords that the model has determined arerepresentative of the documents labeled as in class. The excludedkeyword window 704 may provide a visual indication of the keywords thatthe model has determined are representative of the documents labeled asout of class. The keywords may each be associated with a differentweighting value or otherwise may be more or less important fordetermining document classification. A visual indication of theseweighting values may be provided in the included keyword window 702 andthe excluded keyword window 704. For example, a keyword located at thetop of a given window or is otherwise associated with a highest-rankingindicator may be the most relevant keyword to the grouping. As shown inFIG. 7B, the example lists illustrate that the keyword “Word A” is mostimportant for determining relevant documents while keyword “Word 1” ismost important for determining out of class documents. In the list view,an indication of the importance and/or a confidence value associatedwith the keyword being included in a given window may be displayed. Thismay provide the user with an indication of not just the ranking of thekeywords, but how important those keywords have been determined by theclassification model.

In examples, the user interface 750 may be configured to receive userinput associated with the keywords. For example, the user input mayinclude a user confirming that a keyword should be included in one ormore of the included keyword window 702 and the excluded keyword window704. The user input may also include a user indicating that a givenkeyword should be removed, deemphasized, or emphasized more than itcurrently is. User input data corresponding to the user input may beutilized to retrain the classification model. Additionally, a user mayprovide user input indicating that a word that is not included in agiven window should be included, and the classification model may beretrained based at least in part on that user input data.

FIGS. 8-10 illustrate processes associated with document analysisplatforms. The processes described herein are illustrated as collectionsof blocks in logical flow diagrams, which represent a sequence ofoperations, some or all of which may be implemented in hardware,software or a combination thereof. In the context of software, theblocks may represent computer-executable instructions stored on one ormore computer-readable media that, when executed by one or moreprocessors, program the processors to perform the recited operations.Generally, computer-executable instructions include routines, programs,objects, components, data structures and the like that performparticular functions or implement particular data types. The order inwhich the blocks are described should not be construed as a limitation,unless specifically noted. Any number of the described blocks may becombined in any order and/or in parallel to implement the process, oralternative processes, and not all of the blocks need be executed. Fordiscussion purposes, the processes are described with reference to theenvironments, architectures and systems described in the examplesherein, such as, for example those described with respect to FIGS. 1-7Band 11-30 , although the processes may be implemented in a wide varietyof other environments, architectures and systems.

FIG. 8 illustrates a flow diagram of an example process 800 fordetermining a model accuracy. The order in which the operations or stepsare described is not intended to be construed as a limitation, and anynumber of the described operations may be combined in any order and/orin parallel to implement process 800. The operations described withrespect to the process 800 are described as being performed by a clientdevice, and/or a system associated with the document analysis platform.However, it should be understood that some or all of these operationsmay be performed by some or all of components, devices, and/or systemsdescribed herein.

At block 802, the process 800 may include receiving user input dataindicating in class documents and out of class documents from a subsetof first documents. For example, the system may receive user input dataindicating in class documents and out of class documents from a subsetof first documents. For example, if the first documents include 1,000documents, the user input data may indicating classification for asubset, such as 20, of those documents. Users may utilize a userinterface to provide user input, such as the user interface 350 fromFIG. 3B.

At block 804, the process 800 may include training a classificationmodel based at least in part on the user input data. For example, thesystem may utilize that user input data to train a classification modelsuch that the classification model is configured to determine whether agiven document is more similar to those documents marked in class ormore similar to those documents marked out of class. To train theclassification models utilizing this user input data, the documentanalysis platform may perform one or more operations. In some examples,the platform may generate a positive training dataset indicating inclass keywords associated with the documents marked in class by a user.For example, the platform may determine one or more keywords associatedwith a given document that represent the subject matter of thatdocument. This may be performed utilizing one or more documentprocessing techniques, such as term frequency inverse document frequencytechniques, for example. The platform may also generate a negativetraining dataset indicating keywords from the documents marked out ofclass by the user input. Each of these training datasets may then beutilized to train the classification model such that the classificationmodel is configured to determine whether a given document has keywordsthat are more similar to the in class keywords than to the out of classkeywords. In other examples, instead of or in addition to generatingtraining datasets based on keywords, the platform may determine a vectorfor a given document. The vector may be associated with a coordinatesystem and may represent the subject matter of the document in the formof a vector. Vectors may be generated for the documents labeled in classand for the documents labeled out of class. The classification model maybe trained to determine whether a vector representation of a givendocument is closer to the in class vectors than to the out of classvectors in the coordinate system. Techniques to generate vectorsrepresenting documents may include vectorization techniques such asDoc2Vec, or other similar techniques.

In addition to the techniques for training the classification modelsdescribed above, the classification models may also be trained and/ororganized based at least in part on classifications of the documents.For example, when the documents are patents and patent applications, apredetermined classification system may be established for classifyingthe subject matter of a given document. The classification system may bedetermined by the platform, by one or more users, and/or by a thirdparty. For example, patents and patent application may be associatedwith a predefined classification system such as the Cooperative PatentClassification (CPC) system. The CPC system employs CPC codes thatcorrespond to differing subject matter, as described in more detailherein. The CPC codes for a given document may be identified and thecategories associated with those codes may be determined. A userinterface may be presented to the user that presents the determinedcategories and allows a user to select which categories the user findsin class for a given purpose. The selected categories may be utilized asa feature for training the classification models. Additionally, oralternatively, the platform may determine the CPC codes for documentsmarked as in class and may train the classification models to comparethose CPC codes with the CPC codes associated with the documents to beanalyzed to determine classification.

At block 806, the process 800 may include predicting classification of aremainder of the first documents utilizing the classification model astrained. For example, utilizing the classification model, as trained,the system may predict the classification of the remainder of the firstdocuments that were not labeled by the user input.

At block 808, the process 800 may include determining whether aconfidence value associated with the model satisfies a thresholdconfidence value. For example, each or some of the predictions for theremainder of the documents may be associated with a confidence valueindicating how confident the system is that the classification modelaccurately determined the classification of a given document. Thethreshold confidence value may be determined and the system maydetermine whether, an overall confidence value associated with theclassification model satisfies that threshold confidence value.

In examples where the confidence value does not satisfy the thresholdconfidence value, at block 810, the process 800 may include requestingadditional user input data. For example, in instances where theconfidence value does not satisfy the threshold confidence value, thesystem may cause an indication of this determination to be displayed andmay request additional user input data for retraining the classificationmodel.

In examples where the confidence value satisfies the thresholdconfidence value, at block 812, the process 800 may include receivingsecond documents for classification prediction. For example, ininstances where the confidence value satisfies the threshold confidencevalue, the system may receive second documents for classificationprediction. The second documents may be received based at least in parton a user uploading additional documents and/or from the systemretrieving additional documents from one or more databases.

At block 814, the process 800 may include predicting the classificationof the second documents utilizing the classification model. For example,the classification model may be utilized to determine whether some orall of the documents in the second document set are more closely relatedto the in class documents than to the out of class documents.

FIG. 9 illustrates a flow diagram of an example process 900 forgenerating a positive dataset of keywords and a negative dataset ofkeywords for classification model training. The order in which theoperations or steps are described is not intended to be construed as alimitation, and any number of the described operations may be combinedin any order and/or in parallel to implement process 900. The operationsdescribed with respect to the process 900 are described as beingperformed by a client device, and/or a system associated with thedocument analysis platform. However, it should be understood that someor all of these operations may be performed by some or all ofcomponents, devices, and/or systems described herein.

At block 902, the process 900 may include receiving documents. Forexample, the document analysis platform may be configured to receivedata representing documents. The documents may be provided to thedocument analysis platform by users uploading those documents and/or bythe document analysis platform fetching or otherwise retrieving thosedocuments.

At block 904, the process 900 may include generating document data fromthe documents. For example, the document analysis platform may beconfigured to identify portions of the documents and tag or otherwiseseparate those portions. For example, when the documents are patents andpatent applications, the document analysis platform may be configured toidentify portions of the documents corresponding to the title, thepublication number, the abstract, the detailed description, the figures,and/or the claims, for example. This may be performed utilizing keywordrecognition and/or based on one or more rules associated with theformatting of the document, such as numbered paragraphs corresponding toclaims, a format for a publication number, presence of image data, etc.

At block 906, the process 900 may include displaying the document dataand classification voting functionality. For example, a summary userinterface may include portions of the documents and informationassociated with the documents, such as whether the documents have beenlabeled, a prediction made in association with the document, aconfidence value associated with the prediction, and/or an evaluation ofthe document and/or a portion of the document. In the example where thedocuments are patents and patent applications, the user interface maydisplay portions of the documents such as the publication number, thetitle, the abstract, one or more claims, and/or a claim score. The claimscore may be based at least in part on an analysis of the claims of agiven patent and the claim score may be provided by way of a scale, suchas a scale from 0 to 5, where 0 represents the broadest claim score and5 represents the narrowest claim score. The user interface may alsoprovide some information about the project category associated with thedocuments, such as the category title, a category progress indicatinghow many documents have been labeled and/or predicted to be in class andout of class, how many documents have been skipped, and/or a totalnumber of uploaded documents. The user interface may also provideoptions for viewing the document summaries, such as a filter option tofilter the summaries based at least in part on one or more of theattributes of the documents and/or the analysis of the documents. Theoptions may also include a sort options, which may be utilized to sortthe summaries based at least in part on one or more of the attributes ofthe documents and/or the analysis of the documents. The options may alsoa columns option, which may be utilized to remove or add columns ofinformation to the summaries. The options may also include an actionoption, which may be utilized to take an action in association with adocument, such as tagging a document, removing a document, editing adocument, etc. In addition, the user interface may include selectableportions associated with some or each of the document summaries that,when selected, may cause another user interface to display the fulldocument associated with the selected portion.

At block 908, the process 900 may include receiving user input dataassociated with classification voting functionality. For example, theuser may provide user input utilizing the user interface 350 asdescribed with respect to FIG. 3B.

At block 910, the process 900 may include generating a positive trainingdataset indicating in class keywords. For example, the platform maygenerate a positive training dataset indicating in class keywordsassociated with the documents marked in class by a user. For example,the platform may determine one or more keywords associated with a givendocument that represent the subject matter of that document. This may beperformed utilizing one or more document processing techniques, such asterm frequency inverse document frequency techniques, for example.

At block 912, the process 900 may include generating a negative trainingdataset indicating out of class keywords. For example, the platform mayalso generate a negative training dataset indicating keywords from thedocuments marked out of class by the user input.

At block 914, the process 900 may include training a classificationmodel with the positive training dataset and/or with the negativetraining dataset. For example, each of these training datasets may beutilized to train the classification model such that the classificationmodel is configured to determine whether a given document has keywordsthat are more similar to the in class keywords than to the out of classkeywords.

At block 916, the process 900 may include predicting classification ofdocuments utilizing the classification model. For example, the trainedclassification model may be configured to intake a given document anddetermine one or more keywords associated with that document. Thosesample keywords may be compared to the keywords from the classificationmodel to determine whether the sample keywords are more closely relatedto the classification keywords or the out of class keywords from thetraining datasets.

FIG. 10 illustrates a flow diagram of an example process 1000 forgenerating a positive dataset of vectors and a negative dataset ofvectors for classification model training. The order in which theoperations or steps are described is not intended to be construed as alimitation, and any number of the described operations may be combinedin any order and/or in parallel to implement process 1000. Theoperations described with respect to the process 1000 are described asbeing performed by a client device, and/or a system associated with thedocument analysis platform. However, it should be understood that someor all of these operations may be performed by some or all ofcomponents, devices, and/or systems described herein.

At block 1002, the process 1000 may include receiving documents. Forexample, the document analysis platform may be configured to receivedata representing documents. The documents may be provided to thedocument analysis platform by users uploading those documents and/or bythe document analysis platform fetching or otherwise retrieving thosedocuments.

At block 1004, the process 1000 may include generating document datafrom the documents. For example, the document analysis platform may beconfigured to identify portions of the documents and tag or otherwiseseparate those portions. For example, when the documents are patents andpatent applications, the document analysis platform may be configured toidentify portions of the documents corresponding to the title, thepublication number, the abstract, the detailed description, the figures,and/or the claims, for example. This may be performed utilizing keywordrecognition and/or based on one or more rules associated with theformatting of the document, such as numbered paragraphs corresponding toclaims, a format for a publication number, presence of image data, etc.

At block 1006, the process 1000 may include displaying the document dataand classification voting functionality. For example, a summary userinterface may include portions of the documents and informationassociated with the documents, such as whether the documents have beenlabeled, a prediction made in association with the document, aconfidence value associated with the prediction, and/or an evaluation ofthe document and/or a portion of the document. In the example where thedocuments are patents and patent applications, the user interface maydisplay portions of the documents such as the publication number, thetitle, the abstract, one or more claims, and/or a claim score. The claimscore may be based at least in part on an analysis of the claims of agiven patent and the claim score may be provided by way of a scale, suchas a scale from 0 to 5, where 0 represents the broadest claim score and5 represents the narrowest claim score. The user interface may alsoprovide some information about the project category associated with thedocuments, such as the category title, a category progress indicatinghow many documents have been labeled and/or predicted to be in class andout of class, how many documents have been skipped, and/or a totalnumber of uploaded documents. The user interface may also provideoptions for viewing the document summaries, such as a filter option tofilter the summaries based at least in part on one or more of theattributes of the documents and/or the analysis of the documents. Theoptions may also include a sort options, which may be utilized to sortthe summaries based at least in part on one or more of the attributes ofthe documents and/or the analysis of the documents. The options may alsoa columns option, which may be utilized to remove or add columns ofinformation to the summaries. The options may also include an actionoption, which may be utilized to take an action in association with adocument, such as tagging a document, removing a document, editing adocument, etc. In addition, the user interface may include selectableportions associated with some or each of the document summaries that,when selected, may cause another user interface to display the fulldocument associated with the selected portion.

At block 1008, the process 1000 may include receiving user input dataassociated with classification voting functionality. For example, theuser may provide user input utilizing the user interface 350 asdescribed with respect to FIG. 3B.

At block 1010, the process 1000 may include generating a positivetraining dataset indicating positive vectors. For example, the platformmay generate a positive training dataset indicating that a vectorizedrepresentation of a document is marked in class by a user. For example,the platform may determine a vector in a coordinate system thatrepresents the subject matter of a document. This may be performedutilizing one or more document processing techniques, such as Doc2Vec,for example.

At block 1012, the process 1000 may include generating a negativetraining dataset indicating negative vectors. For example, the platformmay also generate a negative training dataset indicating vectorsrepresenting documents marked out of class by the user input.

At block 1014, the process 1000 may include training a classificationmodel with the positive training dataset and/or with the negativetraining dataset. For example, each of these training datasets may beutilized to train the classification model such that the classificationmodel is configured to determine whether a sample vector representing agiven document is closer, in the coordinate system, to the positivevectors than to the negative vectors.

At block 1016, the process 1000 may include predicting classification ofdocuments utilizing the classification model. For example, the trainedclassification model may be configured to intake a given document anddetermine a vector representing that document. That sample vector may becompared to the positive and negative vectors to determine whether thesample vector is more closely related to the positive vectors or thenegative vectors from the training datasets.

FIG. 11 illustrates a conceptual diagram of an example process 1100 forutilizing document categorization for receiving user input data and/orfor training classification models. FIG. 11 illustrates a progression,from left to right and top to bottom, of information displayed on and/orinteractions with one or more user interfaces.

In addition to the techniques for training the classification modelsdescribed herein associated with receiving user input indicating whetherdocuments are in class or out of class, the classification models mayalso be trained and/or organized based at least in part onclassifications of the documents 102. For example, when the documents102 are patents and patent applications, a predetermined classificationsystem may be established for classifying the subject matter of a givendocument 102. The classification system may be determined by theplatform, by one or more users, and/or by a third party. For example,patents and patent application may be associated with a predefinedclassification system such as the Cooperative Patent Classification(CPC) system. The CPC system employs CPC codes 1102 that correspond todiffering subject matter. The CPC codes 1102 for a given document 102may be identified and the categories associated with those codes 1102may be determined. For example, using FIG. 11 , the CPC codes 1102 mayinclude one or more numbers and/or letters that have been predefined tocorrespond to various document categories and/or subcategories. Forcertain documents 102, multiple codes 1102 may be identified and/orutilized. For example, the CPC codes 1102 in FIG. 11 includeinternational classification codes and United States classificationcodes. The specific international classification code in FIG. 11 isF21V. This CPC code may correspond to one or more categories and/orsubcategories, here being “mechanical engineering; lighting; heating;weapons; blasting engines or pumps” being a category corresponding tothe “F” portion of the code. “Lighting” may correspond to the “21”portion of the code. And “V” may correspond to the subcategory of“functional features or details of lighting devices or systems thereof;structural combination of lighting devices with other articles, nototherwise provided for.”

A user interface 1104 may be presented to the user with the determinedcategories and the user interface 1104 may allow a user to select whichcategories the user finds in class for a given purpose. As shown in FIG.11 , a user may select one or more of the categories and/orsubcategories, such as “mechanical engineering,” “lighting devices,”“light distribution,” “portable lights,” etc. The selected categoriesmay be utilized as a feature for training the classification models.Additionally, or alternatively, the platform may determine the CPC codes1102 for documents 102 marked as in class and may train theclassification models to compare those CPC codes 1102 with the CPC codes1102 associated with the documents 102 to be analyzed to determineclassification. Additionally, determination of the CPC codes 1102 may beutilized to search for and acquire additional documents to be utilizedfor training a classification model and/or for utilizing the model todetermine classification of such documents. In these examples, theadditional documents may be provided to the model builder component 228,which may utilize those documents as described herein.

In addition to the above, CPC codes 1102 and/or other documentclassification systems may be utilized to determine which documents willbe predicted as in class and which documents will be predicted as out ofclass. For example, as described above, keywords associated with inclass documents may be utilized for comparison with sample keywords todetermine if a sample document should be classified as in class. Inaddition to such keywords, the CPC codes 1102 may also be utilized. Forexample, a given document having keywords and a first CPC code 1102 maybe predicted as in class, while another document having the samekeywords but having a second CPC codes 1102 may be predicted as out ofclass. By so doing, the classification models may be trained to utilizethe CPC codes 1102 as an indicator of whether a document should beclassified as in class or out of class.

FIG. 12 illustrates a conceptual diagram of an example process 1200 forreceiving user input data for training classification models. FIG. 12illustrates a progression, from left to right and top to bottom, ofinformation displayed on and/or interactions with one or more userinterfaces.

In examples, a full document user interface may include informationabout documents being reviewed by a user, such as the document title,publication number, abstract, claims, and category notes such as thenumber of documents marked in class and out of class, the number ofdocuments skipped, the number of documents that have been labeled, andanalysis details of the document. The user interface may provideadditional information regarding some or all of the aspects of a givendocument. For example, additional portions of the abstract and/oradditional claims and/or claim language may be displayed. Additionally,the category progress information and analysis details may be displayedin a category notes window. The analysis details may include theprediction made with respect to the document, such as whether aclassification model determined that the document was in class or out ofclass, a confidence value associated with that determination, and aclaim score associated with the claims of the document.

In addition to the above, the user interface may provide a voting window354 that may allow a user to provide user input indicating whether thedocument should be labeled as relevant or otherwise “in class” orirrelevant or otherwise “out of class.” Additional options may include“skip” and “undo” for example. The voting window 354 may also beutilized to present one or more of the keywords to enable “hotkeys” orotherwise shortcut keys to allow for the user input via a keyboard orsimilar device as opposed to a mouse scrolling and clicking one of theoptions, and an option to utilize uncertainty sampling. For example, auser may view the information about the document in the user interface.After review of some or all of the information being displayed, the usermay determine that the document is either in class or out of class (ordetermine that the document is to be skipped). In examples where thedocument is to be labeled as in class, the user may utilize one or moreinput means to select a portion of the screen corresponding to the “in”option. In examples where the document is to be labeled as out of class,the user may utilize one or more input means to select a portion of thescreen corresponding to the “out” option. Alternatively, when hotkeysare enabled, the user may select the corresponding hotkey on a keyboard(whether physical or digital). Upon selection of one of the options inthe voting window 354, the user interface may be caused to display thenext unlabeled document in the document set to allow the user to reviewthat document and provide user input associated with the classificationof that document.

As shown in FIG. 12 , when a user selects the “in” portion of the userinterface and/or otherwise indicates that the given document is inclass, that document and/or a feature and/or attribute of that documentmay be saved to a positive dataset 1202. For example, when the modelsutilize keywords for document comparison as described herein, keywordsassociated with the document labeled “in” may be stored in associationwith the positive dataset 1202, along with additional information suchas weighting values associated with the keywords and/or confidencevalues associated with the determination of the keywords. In exampleswhere the models utilize vectors for document comparison as describedherein, a vector associated with the document labeled “in” may be storedin association with the positive dataset 1202, along with additionalinformation such as weighting values and/or confidence values.Additional documents where the user indicates that the documents are inclass may also be stored in association with the positive dataset 1202.

When a user selects the “out” portion of the user interface and/orotherwise indicates that the given document is out of class, thatdocument and/or a feature and/or attribute of that document may be savedto a negative dataset 1204. For example, when the models utilizekeywords for document comparison as described herein, keywordsassociated with the document labeled “out” may be stored in associationwith the negative dataset 1204, along with additional information suchas weighting values associated with the keywords and/or confidencevalues associated with the determination of the keywords. In exampleswhere the models utilize vectors for document comparison as describedherein, a vector associated with the document labeled “out” may bestored in association with the negative dataset 1204, along withadditional information such as weighting values and/or confidencevalues. Additional documents where the user indicates that the documentsare out of class may also be stored in association with the negativedataset 1204.

As described more fully herein, the classification model may be trainedutilizing the labeled documents. For example, the datasets 1202, 1204may be utilized to train the classification model how to identify, forthe documents that have not been labeled, whether a given document is inclass or out of class. To do so, the datasets 1202, 1204 may be utilizedby the model builder component 228 to train the classification model tocompare in class and out of class keywords with keywords representativeof a sample document, and/or to compare in class and out of classvectors with a vector representative of the sample document.

FIGS. 13 and 14 illustrate processes associated with document analysisplatforms. The processes described herein are illustrated as collectionsof blocks in logical flow diagrams, which represent a sequence ofoperations, some or all of which may be implemented in hardware,software or a combination thereof. In the context of software, theblocks may represent computer-executable instructions stored on one ormore computer-readable media that, when executed by one or moreprocessors, program the processors to perform the recited operations.Generally, computer-executable instructions include routines, programs,objects, components, data structures and the like that performparticular functions or implement particular data types. The order inwhich the blocks are described should not be construed as a limitation,unless specifically noted. Any number of the described blocks may becombined in any order and/or in parallel to implement the process, oralternative processes, and not all of the blocks need be executed. Fordiscussion purposes, the processes are described with reference to theenvironments, architectures and systems described in the examplesherein, such as, for example those described with respect to FIGS. 1-12and 15-30 , although the processes may be implemented in a wide varietyof other environments, architectures and systems.

FIG. 13 illustrates a flow diagram of an example process 1300 forutilizing a classification model to determine whether a given documentis in class or out of class. The order in which the operations or stepsare described is not intended to be construed as a limitation, and anynumber of the described operations may be combined in any order and/orin parallel to implement process 1300. The operations described withrespect to the process 1300 are described as being performed by a clientdevice, and/or a system associated with the document analysis platform.However, it should be understood that some or all of these operationsmay be performed by some or all of components, devices, and/or systemsdescribed herein.

At block 1302, the process 1300 may include selecting a classificationmodel. For example, one or more models for predicting the classificationof documents in a document set may be made available for use. Thosemodels may include one or more models that utilize predictive analyticsto predict one or more outcomes. Predictive analytic techniques mayinclude, for example, predictive modelling, machine learning, and/ordata mining. Generally, predictive modelling may utilize statistics topredict outcomes. Machine learning, while also utilizing statisticaltechniques, may provide the ability to improve outcome predictionperformance without being explicitly programmed to do so. A number ofmachine learning techniques may be employed to generate and/or modifythe layers and/or models describes herein. Those techniques may include,for example, decision tree learning, association rule learning,artificial neural networks (including, in examples, deep learning),inductive logic programming, support vector machines, clustering,Bayesian networks, reinforcement learning, representation learning,similarity and metric learning, sparse dictionary learning, and/orrules-based machine learning.

Information from stored and/or accessible data may be extracted from oneor more databases, and may be utilized to predict trends and behaviorpatterns. The predictive analytic techniques may be utilized todetermine associations and/or relationships between explanatoryvariables and predicted variables from past occurrences and utilizingthese variables to predict the unknown outcome. The predictive analytictechniques may include defining the outcome and data sets used topredict the outcome.

Data analysis may include using one or more models, including forexample one or more algorithms, to inspect the data with the goal ofidentifying useful information and arriving at one or moredeterminations that assist in predicting the outcome of interest. One ormore validation operations may be performed, such as using statisticalanalysis techniques, to validate accuracy of the models. Thereafterpredictive modelling may be performed to generate accurate predictivemodels.

At block 1304, the process 1300 may include determining a firstsimilarity value of sample features to reference features indicatingdocuments that are in class. For example, when keywords are utilized torepresent documents, keywords associated with the documents labelled asin class from user input may be compared to keywords of a sampledocument. When those keywords correspond to each other well, orotherwise the reference documents and the sample document sharekeywords, particularly keywords that are heavily weighted as being inclass, then the first similarity value may be high. When the keywords ofthe reference documents and the keywords of the sample document do notcorrelate well, then the first similarity value may be low. In exampleswhere vectors are utilized, vectors associated with the documentslabelled as in class from user input may be compared to a vector of asample document. When that vector is closer in distance to vectorsassociated with in class documents than to out of class documents, thefirst similarity value may be high.

At block 1306, the process 1300 may include determining a secondsimilarity value of the sample features to reference features indicatingdocuments that are out of class. Determining the second similarity valuemay be performed in the same or a similar manner as determining thefirst similarity value. However, instead of determining how closely asample document correlates to documents determined to be in class, thesecond similarity value indicates how closely a sample document iscorrelated to documents marked out of class by user input.

At block 1308, the process 1300 may include determining whether thefirst similarity value is greater than the second similarity value. Forexample, the similarity values may be compared to determine which valueis greater, assuming a scale where greater equates to higher confidence.

In examples where the first similarity value is not greater than thesecond similarity value, then at block 1310 the process 1300 may includedetermining that the document is out of class. The document may bemarked as out of class and counted as “out” for purposes of display to auser. A confidence value, which may indicate how closely the documentcorrelated to out of class documents labeled as such by a user may alsobe determined.

In examples where the first similarity value is greater than the secondsimilarity value, then at block 1312 the process 1300 may includedetermining that the document is in class. The document may be marked asin class and counted as “in” for purposes of display to a user. Aconfidence value, which may indicate how closely the document correlatedto in class documents labeled as such as user may also be determined.

FIG. 14 illustrates a flow diagram of an example process 1400 fordetermining a labelling influence value associated with receiving userinput data to retrain a classification model. The order in which theoperations or steps are described is not intended to be construed as alimitation, and any number of the described operations may be combinedin any order and/or in parallel to implement process 1400. Theoperations described with respect to the process 1400 are described asbeing performed by a client device, and/or a system associated with thedocument analysis platform. However, it should be understood that someor all of these operations may be performed by some or all ofcomponents, devices, and/or systems described herein.

At block 1402, the process 1400 may include training a classificationmodel based at least in part on first user input data associated with afirst set of documents. For example, the process 1300 as described abovemay be utilized to train the classification model. In part, one or moredatasets may be generated based at least in part on user inputindicating a portion of documents that are in class and/or a portion ofdocuments that are out of class. The dataset(s) may be utilized to traina classification model to determine whether features of a sampledocument correlate more to the documents labeled as in class or thedocuments labeled as out of class.

At block 1404, the process 1400 may include receiving second user inputdata associated with a second set of documents. For example, documentsthat had not been previously labeled by the user may be labeled. Thesecond set of documents may be a portion of the original set ofdocuments and/or the second set of documents may include newly-addeddocuments.

At block 1406, the process 1400 may include retraining theclassification model based at least in part on the second user inputdata. Retraining the classification model may be performed in the sameor a similar manner as training the classification model as described atblock 1402.

At block 1408, the process 1400 may include determining a difference ina number of documents determined to be in class by the retainedclassification model as compared to the classification model beforeretraining. For example, before retraining, the classification model maybe run to determine a number of the documents predicted to be in classand a number of the documents predicted to be out of class. Theclassification model may be run again after retraining and a secondnumber of documents predicted to be in class and documents predicted tobe out of class may be determined. For example, having retrained themodel utilizing new labeling data, in examples the model will performdifferently than before training, and that difference may result in someof the documents originally predicted to in class being predicted as outof class after retraining, and/or vice versa.

At block 1410, the process 1400 may include generating a labelinginfluence value indicating a degree of influence on the classificationmodel by the second user input. For example, when the number of in classand out of class documents changes drastically between model trainings,the labeling influence value may indicate this change. This may providea user with an indication that additional user input will have a largeeffect on the model. In other examples, when the number of in class andout of class documents does not change or changes only slightly, thelabeling influence value may indicate this small change. This mayprovide a user with an indication that additional user input will havenegligible effects on the model.

FIG. 15 illustrates a conceptual diagram of an example model taxonomy1500. For example, in addition to the training of classification modelsas described above, once the classification models are trained such thatthe models are determined to accurately predict classification astrained, the models may be placed in a model taxonomy 1500. The modeltaxonomy 1500 may represent a taxonomy tree or otherwise a modelhierarchy indicating relationships between models and/or a level ofspecificity associated with the models. For example, as shown in FIG. 15, a first model 1502 associated with determining whether documents arein class with respect to “computers,” may be associated with othermodels 1505, 1510, 1516 trained to determine whether documents are inclass with respect to “processors,” “memory,” and “keyboards,”respectively. Each of these models may also be associated with othermodels 1506, 1508, 1512, 1514, 1518, 1520 trained to determine morespecific aspects of these components, such as “microprocessors” and“processor components,” or “RAM” and “partitioned memory.” This taxonomy1500 may be searchable and may provide functionality that allows a userto provide a search query for a model. The keywords from the searchquery may be utilized to identify models that may be applicable to thesearch query and/or to highlight “branches” of the taxonomy associatedwith the search query.

As shown in FIG. 15 , the models in the model taxonomy 1500 may belinked to each other in one or more ways. For example, when the subjectmatter of one model is related to the subject matter of another model,those models may be linked in the taxonomy 1500. In some examples, thenodes of the taxonomy representing the models may be determinedutilizing a predefined subject matter classification system, such as theCPC system described herein.

FIG. 16 illustrates processes associated with document analysisplatforms. The processes described herein are illustrated as collectionsof blocks in logical flow diagrams, which represent a sequence ofoperations, some or all of which may be implemented in hardware,software or a combination thereof. In the context of software, theblocks may represent computer-executable instructions stored on one ormore computer-readable media that, when executed by one or moreprocessors, program the processors to perform the recited operations.Generally, computer-executable instructions include routines, programs,objects, components, data structures and the like that performparticular functions or implement particular data types. The order inwhich the blocks are described should not be construed as a limitation,unless specifically noted. Any number of the described blocks may becombined in any order and/or in parallel to implement the process, oralternative processes, and not all of the blocks need be executed. Fordiscussion purposes, the processes are described with reference to theenvironments, architectures and systems described in the examplesherein, such as, for example those described with respect to FIGS. 1-15and 17-30 , although the processes may be implemented in a wide varietyof other environments, architectures and systems.

FIG. 16 illustrates a flow diagram of an example process 1600 fordetermining portions of a classification model associated withconfidential information and generating a modified classification modelwithout association to the confidential information. The order in whichthe operations or steps are described is not intended to be construed asa limitation, and any number of the described operations may be combinedin any order and/or in parallel to implement process 1600. Theoperations described with respect to the process 1600 are described asbeing performed by a client device, and/or a system associated with thedocument analysis platform. However, it should be understood that someor all of these operations may be performed by some or all ofcomponents, devices, and/or systems described herein.

At block 1602, the process 1600 may include training a classificationmodel based at least in part on user input data. For example, one ormore datasets may be generated based at least in part on user inputindicating a portion of documents that are in class and/or a portion ofdocuments that are out of class. The dataset(s) may be utilized to traina classification model to determine whether features of a sampledocument correlate more to the documents labeled as in class or thedocuments labeled as out of class.

At block 1604, the process 1600 may include determining whether the userinput data and/or other data indicates confidential information. Forexample, when a user profile associated with a user is setup and/or whena project is started, information about the purpose of the project, oneor more restrictions put on the project, and/or user input provided forthe project may be provided. In examples, that information may indicatethat the project and/or portions thereof are confidential and/or areotherwise restricted from use by other users. For example, a company maydesire to utilize the document analysis platform described herein forone or more business reasons, but in generating a classification modelto serve those purposes, the user input provided and/or the results fromthe classification model may indicate the user's purpose. When thatpurpose is confidential, an indication of such may be provided by theuser and/or may be inferred by the platform. For example, at least aportion of the user input utilized to train the classification model inquestion may be denoted as confidential by the user.

In examples where the user input data and/or the other data does notindicate confidential information, then at block 1606 the process 1600may include publishing the classification model to a model taxonomy. Forexample, the classification model may be deemed sufficient forpublishing to the model taxonomy such that use by others as describedelsewhere herein may not impact confidentiality and/or restrictionsassociated with the user that was involved in training the model.

In example where the user input data and/or the other data indicatesconfidential information, then at block 1608 the process 1600 mayinclude generating a modified classification model without use ofconfidential information. For example, the user input data associatedwith confidential information and/or the restrictions may be removedfrom the dataset(s) utilized for training the classification model. Theclassification model may be retrained without that confidential userinput, resulting in a modified classification model.

At block 1610, the process 1600 may include publishing the modifiedclassification model to the model taxonomy. The modified classificationmodel may be published in the same or a similar manner as describedabove with respect to block 1606. In examples, when a classificationmodel has been modified pursuant to this process, an indication of themodification may be provided to other users of the model. This mayprovide those users with an indication that additional training of themodel may be desirable.

FIG. 17 illustrates a conceptual diagram of an example process 1700 forpresenting at least a portion of a model taxonomy 1500 based at least inpart on a user query and utilizing a selected model for documentanalysis. FIG. 17 illustrates a progression, from left to right and topto bottom, of information displayed on and/or interactions with one ormore user interfaces.

For example, a user interface 1702 may be generated and/or displayed toallow for user input to be received for searching the model taxonomy1500 for models that may be utilized by a user for determiningclassification. The user interface 1702 may be configured to receiveuser input representing search terms for searching for models. Thesesearch terms may collectively be referred to as a search query. Thetaxonomy may be searchable and may provide functionality that allows auser to provide the search query for a model. The keywords from thesearch query may be utilized to identify models that may be applicableto the search query and/or to highlight “branches” of the taxonomy 1500associated with the search query.

A user interface may be utilized to display indications of the modelsidentified during a model search, and the user interface may beconfigured to receive user input indicating selection of a given modelfor use in determining classification of documents. The user and/or theplatform may then upload the document set 102 to be analyzed and theselected model 1704 may be utilized to predict classification ofindividual ones of the documents 102. A user interface indicating theclassification predictions 1706 as performed utilizing the selectedmodel 1704 may be displayed as well as a confidence value associatedwith the accuracy of the model in determining classification. This mayprovide the user with an indication of whether the selected model 1704is sufficient for analyzing the documents at issue, or whether anothermodel should be selected, or a new model should be trained.

FIG. 18 illustrates a conceptual diagram of an example model taxonomy1800 showing gaps in the taxonomy 1800 where classification models havenot been trained and/or require further training.

For example, the model taxonomy 1800 may provide an indication of wheremodels have not been trained for a given subject matter. For example, a“node” on the model taxonomy 1800 may be left blank or may indicate thata model has not yet been trained. With respect to FIG. 18 , such anindication includes a shading of a node of the taxonomy 1800.Additionally, in examples, the line or other linkage between nodes inthe taxonomy may be displayed as dashed or otherwise not complete. Thenodes may be determined from the predefined classification system, suchas by CPC codes. This may provide a user an indication of whetherselecting an already-trained model would be preferable to training a newmodel. In FIG. 18 , the force sensors node 1804, key linkage node 1810,and lighting node 1816 are shown as related to the keyboards node 1802.The force data processing node 1806 is not shaded, indicating that amodel has been trained for this subject matter. However, the sensormaterials node 1808, like the removeable keys node 1814 are displayed asshaded and dashed lines connect those nodes to the next tier of nodes inthe taxonomy 1800. This may indicate that models have not been trainedfor that subject matter and/or that models that have been trained do notsufficiently determine classification of documents.

The model taxonomy 1800 may also provide an indication of how closelyrelated models are on the hierarchy. This indication may be presented byway of lines between “nodes” in the taxonomy 1800, where the length of aline may indicate how closely the models are related to each other. Withrespect to FIG. 18 , the key linkage node 1810 is displayed closer tothe lighting node 1816 than to the force sensors node 1804. This mayindicate that the key linkage model is more closely related to thelighting model than to the force sensors model.

In addition, when a user query for use of a model is received, the modeland/or models that most closely match up with the search query may beidentified. The platform may determine whether at least one of theresulting models has keywords that are sufficiently similar to thekeywords in the search query. In examples where there is sufficientsimilarity, indicators of those models may be presented as results tothe user. In examples where there is insufficient similarity, the userinterface may return results indicating that no models in the modeltaxonomy are sufficient in light of the search query, and may requestthat the user perform the operations associated with training a newmodel.

FIG. 19 illustrates a conceptual diagram of an example process 1900 fordetermining which models in a model taxonomy to present in response to auser query for utilizing a model. FIG. 19 illustrates a progression,from left to right and top to bottom, of information displayed on and/orinteractions with one or more user interfaces.

For example, a user interface 1902 may be generated and/or displayed toallow for user input to be received for searching the model taxonomy formodels that may be utilized by a user for determining classification.The user interface 1902 may be configured to receive user inputrepresenting search terms for searching for models. These search termsmay collectively be referred to as a search query. The taxonomy may besearchable and may provide functionality that allows a user to providethe search query for a model. The keywords from the search query may beutilized to identify models that may be applicable to the search queryand/or to highlight “branches” of the taxonomy associated with thesearch query.

As shown in FIG. 19 , the search query is “keyboard keys.” This searchquery may be compared to titles of the models in the taxonomy and/or totags associated with the models to determine which model or models maycorrelate to the search query. The tags may be identified via user inputthat includes the tags and/or may be based on keywords associated withthe documents determined to be in class by the model. In FIG. 19 , the“keyboard keys” search query may result in the document analysisplatform identifying the “keyboards” model 1802, the “key linkage” model1810, and the “removeable keys” model 1814 as correlating at least inpart to the search query. In examples, the platform may determine that aparticular model correlates to the search query, but the platform maynonetheless surface that model and other models that are linked orotherwise associated with that model as search results. For example, themodel(s) in the taxonomy that are one tier higher and/or one tier lowermay also be surfaced. This may allow the user to determine the level ofabstraction of the model to be selected.

A user interface may be utilized to display indications of the modelsidentified during a model search, and the user interface may beconfigured to receive user input indicating selection of a given modelfor use in determining classification of documents. The user and/or theplatform may then upload a document set to be analyzed and the selectedmodel may be utilized to predict classification of individual ones ofthe documents. A user interface indicating the classificationpredictions as performed utilizing the selected model may be displayedas well as a confidence value associated with the accuracy of the modelin determining classification. This may provide the user with anindication of whether the selected model is sufficient for analyzing thedocuments at issue, or whether another model should be selected, or anew model should be trained.

FIGS. 20-30 illustrate processes associated with document analysisplatforms. The processes described herein are illustrated as collectionsof blocks in logical flow diagrams, which represent a sequence ofoperations, some or all of which may be implemented in hardware,software or a combination thereof. In the context of software, theblocks may represent computer-executable instructions stored on one ormore computer-readable media that, when executed by one or moreprocessors, program the processors to perform the recited operations.Generally, computer-executable instructions include routines, programs,objects, components, data structures and the like that performparticular functions or implement particular data types. The order inwhich the blocks are described should not be construed as a limitation,unless specifically noted. Any number of the described blocks may becombined in any order and/or in parallel to implement the process, oralternative processes, and not all of the blocks need be executed. Fordiscussion purposes, the processes are described with reference to theenvironments, architectures and systems described in the examplesherein, such as, for example those described with respect to FIGS. 1-19, although the processes may be implemented in a wide variety of otherenvironments, architectures and systems.

FIG. 20 illustrates a flow diagram of an example process 2000 fordetermining whether to utilizing a model from a model taxonomy orwhether to request training of a new model. The order in which theoperations or steps are described is not intended to be construed as alimitation, and any number of the described operations may be combinedin any order and/or in parallel to implement process 2000. Theoperations described with respect to the process 2000 are described asbeing performed by a client device, and/or a system associated with thedocument analysis platform. However, it should be understood that someor all of these operations may be performed by some or all ofcomponents, devices, and/or systems described herein.

At block 2002, the process 2000 may include receiving a search query.For example, a user interface may be generated and/or displayed to allowfor user input to be received for searching the model taxonomy formodels that may be utilized by a user for determining classification.The user interface may be configured to receive user input representingsearch terms for searching for models. These search terms maycollectively be referred to as a search query. The taxonomy may besearchable and may provide functionality that allows a user to providethe search query for a model. The keywords from the search query may beutilized to identify models that may be applicable to the search queryand/or to highlight “branches” of the taxonomy associated with thesearch query.

At block 2004, the process 2000 may include determining whether a modelclassification satisfies a threshold similarity to the search query. Forexample, when the words in the search query are keywords associated witha given model, a model classification value associated with thecorrelation between the search query and the subject matter the model istrained based on may be high and may satisfy a threshold similarity. Inother examples, the search query may only partially correlate to thesubject matter the model is trained based on. In these examples, themodel classification value may not satisfy the threshold similarity tothe search query.

In examples where the classification satisfies the threshold similarity,at block 2006, the process 2000 may include presenting indicators of oneor more models as search results. For example, a user interface may becaused to display an indicator of the one or more models as searchresults. The search results may include, for example, the title of themodel(s), a confidence value that the model(s) correlate to the searchquery, and an option to select one or more of the models for use inpredicting document classification.

At block 2008, the process 2000 may include receiving a user selectionof a model from the one or more models. For example, the user mayprovide user input via the user interface indicating that the userdesires to use a given model of the search results to analyze documents.

At block 2010, the process 2000 may include receiving documents to beanalyzed utilizing the selected model. For example, the user may uploaddocuments to the document analysis platform and/or the platform mayquery one or more databases for the documents. Data representing thedocuments may be received at the platform and stored in a database foranalysis.

At block 2012, the process 2000 may include running the selected modelagainst the documents. For example, the model may determine a firstsimilarity value of sample features to reference features indicatingdocuments that are in class. For example, when keywords are utilized torepresent documents, keywords associated with the documents labelled asin class from user input may be compared to keywords of a sampledocument. When those keywords correspond to each other well, orotherwise the reference documents and the sample document sharekeywords, particularly keywords that are heavily weighted as being inclass, then the first similarity value may be high. When the keywords ofthe reference documents and the keywords of the sample document do notcorrelate well, then the first similarity value may be low. In exampleswhere vectors are utilized, vectors associated with the documentslabelled as in class from user input may be compared to a vector of asample document. When that vector is closer in distance to vectorsassociated with in class documents than to out of class documents, thefirst similarity value may be high. The model may also determine asecond similarity value of the sample features to reference featuresindicating documents that are out of class. Determining the secondsimilarity value may be performed in the same or a similar manner asdetermining the first similarity value. However, instead of determininghow closely a sample document correlates to documents determined to bein class, the second similarity value indicates how closely a sampledocument is correlated to documents marked out of class by user input.

The model may then determine whether the first similarity value isgreater than the second similarity value. For example, the similarityvalues may be compared to determine which value is greater, assuming ascale where greater equates to higher confidence. In examples where thefirst similarity value is not greater than the second similarity value,then the model may determine that the document is out of class. Thedocument may be marked as out of class and counted as “out” for purposesof display to a user. A confidence value, which may indicate how closelythe document correlated to out of class documents labeled as such by auser may also be determined. In examples where the first similarityvalue is greater than the second similarity value, then the model maydetermine that the document is in class. The document may be marked asrelevant and counted as “in” for purposes of display to a user. Aconfidence value, which may indicate how closely the document correlatedto in class documents labeled as such as user may also be determined.

At block 2014, the process 2000 may include presenting the results. Forexample, the results may be presented utilizing the user interface 100described with respect to FIGS. 1 and 4 .

Returning to block 2004, in examples where the classification does notsatisfy the threshold similarity, then at block 2016 the process 2000may include requesting model training. For example, a user interface maybe caused to display an indication that none of the models in the modeltaxonomy correlate, or correlate well enough, to the search query to beutilized for predictive purposes. The user interface may include a linkor other selectable portion that, when selected, may cause the userinterface to engage the model builder component to start the process oftraining a model.

At block 2018, the process 2000 may include receiving user input datafor model training. For example, documents may be upload to theplatform, and the user may provide user input for at least a portion ofthose documents indicating whether a given one of the documents is inclass or out of class.

At block 2020, the process 200 may include training a model based atleast in part on the user input data. Model training may be performed inthe same or a similar manner as described elsewhere herein. For example,when keywords are utilized, the model may be trained to determine whichkeywords correspond to in class documents and which keywords correspondto out of class documents. When vectors are utilized, the model may betrained to determine which vectors correspond to in class documents andwhich vectors correspond to out of class documents.

At block 2022, the process 2000 may include including the model, astrained, in a model taxonomy. For example, based at least in part on thesubject matter of the model, the model may be placed in the taxonomy. Inexamples, the CPC codes or other classification system may be utilizedas described herein to determine whether a model should be placed in thetaxonomy.

FIG. 21 illustrates a flow diagram of an example process 2100 forutilizing user input data to build classification models. The order inwhich the operations or steps are described is not intended to beconstrued as a limitation, and any number of the described operationsmay be combined in any order and/or in parallel to implement process2100. The operations described with respect to the process 2100 aredescribed as being performed by a client device, and/or a systemassociated with the document analysis platform. However, it should beunderstood that some or all of these operations may be performed by someor all of components, devices, and/or systems described herein.

At block 2102, the process 2100 may include receiving documents from oneor more databases, the documents including at least one of patents orpatent applications. For example, the documents may be received based atleast in part on user input and/or as determined by a document analysisplatform.

At block 2104, the process 2100 may include generating first datarepresenting the documents, the first data distinguishing components ofthe documents, the components including at least a title portion, anabstract portion, a detailed description portion, and a claims portion.For example, when the documents are patents and patent applications, theportions of the documents, such as the abstract, title, background,detail description, claims, figures, etc. may be identified anddistinguished.

At block 2106, the process 2100 may include generating a user interfaceconfigured to display: the components of individual ones of thedocuments; and an element configured to accept user input indicatingwhether the individual ones of the documents are in class or out ofclass. For example, the user interface may be the same or similar to theuser interface 350 as described with respect to FIG. 3B.

At block 2108, the process 2100 may include generating a classificationmodel based at least in part on user input data corresponding to theuser input, the classification model trained utilizing at least a firstportion of the documents indicated to be in class by the user inputdata. For example, the system may utilize that user input data to traina classification model such that the classification model is configuredto determine whether a given document is more similar to those documentsmarked in class or more similar to those documents marked out of class.To train the classification models utilizing this user input data, thedocument analysis platform may perform one or more operations. In someexamples, the platform may generate a positive training datasetindicating in class keywords associated with the documents marked inclass by a user. For example, the platform may determine one or morekeywords associated with a given document that represent the subjectmatter of that document. This may be performed utilizing one or moredocument processing techniques, such as term frequency inverse documentfrequency techniques, for example. The platform may also generate anegative training dataset indicating keywords from the documents markedout of class by the user input. Each of these training datasets may thenbe utilized to train the classification model such that theclassification model is configured to determine whether a given documenthas keywords that are more similar to the in class keywords than to theout of class keywords. In other examples, instead of or in addition togenerating training datasets based on keywords, the platform maydetermine a vector for a given document. The vector may be associatedwith a coordinate system and may represent the subject matter of thedocument in the form of a vector. Vectors may be generated for thedocuments labeled in class and for the documents labeled out of class.The classification model may be trained to determine whether a vectorrepresentation of a given document is closer to the in class vectorsthan to the out of class vectors in the coordinate system. Techniques togenerate vectors representing documents may include vectorizationtechniques such as Doc2Vec, or other similar techniques.

In addition to the techniques for training the classification modelsdescribed above, the classification models may also be trained and/ororganized based at least in part on classifications of the documents.For example, when the documents are patents and patent applications, apredetermined classification system may be established for classifyingthe subject matter of a given document. The classification system may bedetermined by the platform, by one or more users, and/or by a thirdparty. For example, patents and patent application may be associatedwith a predefined classification system such as the Cooperative PatentClassification (CPC) system. The CPC system employs CPC codes thatcorrespond to differing subject matter, as described in more detailherein. The CPC codes for a given document may be identified and thecategories associated with those codes may be determined. A userinterface may be presented to the user that presents the determinedcategories and allows a user to select which categories the user findsin class for a given purpose. The selected categories may be utilized asa feature for training the classification models. Additionally, oralternatively, the platform may determine the CPC codes for documentsmarked as in class and may train the classification models to comparethose CPC codes with the CPC codes associated with the documents to beanalyzed to determine classification.

At block 2110, the process 2100 may include causing the user interfaceto display an indication of: the first portion of the documents markedas in class in response to the user input; a second portion of thedocuments marked as out of class in response to the user input; a thirdportion of the documents determined to be in class utilizing theclassification model; and a fourth portion of the documents determinedto be out of class utilizing the classification model. For example, theuser interface may be the same or similar to the user interface 100described with respect to FIGS. 1 and 4 .

Additionally, or alternatively, the process 2100 may include determininga first confidence value associated with results of the classificationmodel. The process 2100 may also include receiving second user inputindicating classification of at least one document determined to be inclass utilizing the classification model. The process 2100 may alsoinclude causing the classification model to be retrained based at leastin part second user input data corresponding to the second user input.The process 2100 may also include determining a second confidence valueassociated with results of the classification model as retained. Theprocess 2100 may also include generating a user interface indicating atrendline representing a change from the first confidence value to thesecond confidence value, the trendline indicating an increase ordecrease in confidence associated with the use of the second user inputdata to retrain the classification model.

Additionally, or alternatively, the process 2100 may include receivingsecond user input indicating classification of at least one documentdetermined to be in class utilizing the classification model. Theprocess 2100 may also include causing the classification model to beretrained based at least in part second user input data corresponding tothe second user input. The process 2100 may also include determining achange in a number of the third portion of the documents marked in classutilizing the in class model as retrained. The process 2100 may alsoinclude generating a user interface indicating an influence value of thesecond user input on output by the classification model, the influencevalue indicating that additional user input is one of likely to orunlikely to have a statistical impact on performance of theclassification model.

Additionally, or alternatively, the process 2100 may include generatingsecond data indicating a relationship between a first document of thedocuments and a second document of the documents, the relationshipindicating that the first document includes at least one component thatis similar to a component of the second document. The process 2100 mayalso include determining that the user input data indicates that thefirst document is in class. The process 2100 may also includedetermining that the second document is in class based at least in parton the second data indicating the relationship. In these examples, thefirst portion of the documents utilized to train the classificationmodel may include the second document.

FIG. 22 illustrates a flow diagram of another example process 2200 forutilizing user input data to build classification models. The order inwhich the operations or steps are described is not intended to beconstrued as a limitation, and any number of the described operationsmay be combined in any order and/or in parallel to implement process2200. The operations described with respect to the process 2200 aredescribed as being performed by a client device, and/or a systemassociated with the document analysis platform. However, it should beunderstood that some or all of these operations may be performed by someor all of components, devices, and/or systems described herein.

At block 2202, the process 2200 may include generating first datarepresenting documents received from one or more databases, the firstdata distinguishing components of the documents. For example, when thedocuments are patents and patent applications, the portions of thedocuments, such as the abstract, title, background, detail description,claims, figures, etc. may be identified and distinguished.

At block 2204, the process 2200 may include generating a user interfaceconfigured to display: the components of individual ones of thedocuments; and an element configured to accept user input indicatingwhether the individual ones of the documents are in class or out ofclass. For example, the user interface may be the same or similar to theuser interface 350 as described with respect to FIG. 3B.

At block 2206, the process 2200 may include generating a model based atleast in part on user input data corresponding to the user input. Forexample, the system may utilize that user input data to train aclassification model such that the classification model is configured todetermine whether a given document is more similar to those documentsmarked in class or more similar to those documents marked out of class.To train the classification models utilizing this user input data, thedocument analysis platform may perform one or more operations. In someexamples, the platform may generate a positive training datasetindicating in class keywords associated with the documents marked inclass by a user. For example, the platform may determine one or morekeywords associated with a given document that represent the subjectmatter of that document. This may be performed utilizing one or moredocument processing techniques, such as term frequency inverse documentfrequency techniques, for example. The platform may also generate anegative training dataset indicating keywords from the documents markedout of class by the user input. Each of these training datasets may thenbe utilized to train the classification model such that theclassification model is configured to determine whether a given documenthas keywords that are more similar to the in class keywords than to theout of class keywords. In other examples, instead of or in addition togenerating training datasets based on keywords, the platform maydetermine a vector for a given document. The vector may be associatedwith a coordinate system and may represent the subject matter of thedocument in the form of a vector. Vectors may be generated for thedocuments labeled in class and for the documents labeled out of class.The classification model may be trained to determine whether a vectorrepresentation of a given document is closer to the in class vectorsthan to the out of class vectors in the coordinate system. Techniques togenerate vectors representing documents may include vectorizationtechniques such as Doc2Vec, or other similar techniques.

In addition to the techniques for training the classification modelsdescribed above, the classification models may also be trained and/ororganized based at least in part on classifications of the documents.For example, when the documents are patents and patent applications, apredetermined classification system may be established for classifyingthe subject matter of a given document. The classification system may bedetermined by the platform, by one or more users, and/or by a thirdparty. For example, patents and patent application may be associatedwith a predefined classification system such as the Cooperative PatentClassification (CPC) system. The CPC system employs CPC codes thatcorrespond to differing subject matter, as described in more detailherein. The CPC codes for a given document may be identified and thecategories associated with those codes may be determined. A userinterface may be presented to the user that presents the determinedcategories and allows a user to select which categories the user findsin class for a given purpose. The selected categories may be utilized asa feature for training the classification models. Additionally, oralternatively, the platform may determine the CPC codes for documentsmarked as in class and may train the classification models to comparethose CPC codes with the CPC codes associated with the documents to beanalyzed to determine classification.

At block 2208, the process 2200 may include determining, utilizing themodel, a first portion of the documents that are in class and a secondportion of the documents that are out of class. For example, whenkeywords are utilized, the model may determine that keywords associatedwith the first portion of the documents correlate to in class keywordsas trained, while keywords associated with the second portion of thedocuments correlate to the out of class keywords as trained. Whenvectors are utilized, the model may determine that vectors associatedwith the first portion of the documents correlate to in class vectors astrained, while vectors associated with the second portion of thedocuments correlate to the out of class vectors as trained.

At block 2210, the process 2200 may include causing the user interfaceto display an indication of first portion of the documents and thesecond portion of the documents with respect to FIGS. 1 and 4 .

Additionally, or alternatively, the process 2200 may include determininga first confidence value associated with results of the model andreceiving second user input indicating classification of at least onedocument determined to be in class utilizing the model. The process 2200may also include causing the model to be retrained based at least inpart second user input data corresponding to the second user input. Theprocess 2200 may also include determining a second confidence valueassociated with results of the model as retained and generating a userinterface indicating a trendline representing a change from the firstconfidence value to the second confidence value.

Additionally, or alternatively, the process 2200 may include receivingsecond user input indicating classification of at least one documentdetermined to be in class utilizing the model. The process 2200 may alsoinclude causing the model to be retrained based at least in part seconduser input data corresponding to the second user input. The process 2200may also include determining a change in a number of the second portionof the documents determined to be in class utilizing the model asretrained. The process 2200 may also include generating a user interfaceindicating an influence value of the second user input on output by themodel.

Additionally, or alternatively, the process 2200 may include generatingsecond data indicating a relationship between a first document of thedocuments and a second document of the documents. The process 2200 mayalso include determining that the user input data indicates that thefirst document is in class and determining that the second document isin class based at least in part on the second data indicating therelationship. In these examples, generating the model may includetraining the model utilizing the second document.

Additionally, or alternatively, the process 2200 may includedetermining, for individual ones of the documents marked as in class, aconfidence value indicating a degree of classification. The process 2200may also include determining a ranking of the individual ones of thedocuments marked as in class based at least in part on the confidencevalue. The process 2200 may also include causing the user interface todisplay the individual ones of the documents marked as in class based atleast in part on the ranking.

Additionally, or alternatively, the process 2200 may include causingdisplay, via the user interface, of indications associated withclassification of the documents. For example, the indications mayinclude a first indication of a first number of the documents marked inclass in response to user input. The indications may also include asecond indication of a second number of the documents marked out ofclass in response to the user input. The indications may also include athird indication of a third number of the documents determined to be inclass utilizing the model. The indications may also include a fourthindication of a fourth number of the documents determined to be out ofclass utilizing the model.

Additionally, or alternatively, the process 2200 may include causingdisplay, via the user interface, of a number of sections. The sectionsmay include a first section indicating first keywords determined to bestatistically relevant by the model for identifying the first portion ofthe documents, wherein the first keywords are displayed in a manner thatindicates a first ranking of statistical classification of the firstkeywords, wherein the first keywords are selectable via user input to beremoved from the first section. The sections may also include a secondsection indicating second keywords determined to be statisticallyrelevant by the model for identifying the second portion of thedocuments, wherein the second keywords are displayed in a manner thatindicates second ranking of the statistical classification of the secondkeywords, wherein the second keywords are selectable via user input tobe removed from the second section. The process 2200 may also include,based at least in part on receiving the user input indicating that atleast one of the first keywords or the second keywords should beremoved, retraining the model to account for removal of the at least oneof the first keywords or the second keywords.

Additionally, or alternatively, the process 2200 may include searching,utilizing the model, one or more databases for additional documentsdetermined to be in class by the model. The process 2200 may alsoinclude receiving an instance of the additional documents from the oneor more databases. The process 2200 may also include receiving userinput indicating classification of the additional documents. The process2200 may also include retraining the model based at least in part on theuser input indicating the classification of the additional documents.

FIG. 23 illustrates a flow diagram of an example process 2300 forbuilding classification models utilizing negative datasets. The order inwhich the operations or steps are described is not intended to beconstrued as a limitation, and any number of the described operationsmay be combined in any order and/or in parallel to implement process2300. The operations described with respect to the process 2300 aredescribed as being performed by a client device, and/or a systemassociated with the document analysis platform. However, it should beunderstood that some or all of these operations may be performed by someor all of components, devices, and/or systems described herein.

At block 2302, the process 2300 may include storing, in association witha platform configured to receive documents from one or more databases,first data representing the documents, the documents including at leastone of patents or patent applications. For example, the documents may bereceived based at least in part on user input and/or as determined by adocument analysis platform.

At block 2304, the process 2300 may include receiving user input dataindicating a first portion of the documents that are out of class, theuser input data received via a user interface configured to displaycomponents of individual ones of the documents and receive user inputindicating whether the individual ones of the documents are in class.For example, the user may review all or a portion of the document andprovide user input indicating whether the document is in class or out ofclass. This may be performed utilizing a user interface such as the userinterface 350 described with respect to FIG. 3B.

At block 2306, the process 2300 may include determining keywords of thefirst portion of the documents that represent the first portion of thedocuments. For example, the platform may generate a training datasetindicating the keywords associated with the documents marked out ofclass by a user. For example, the platform may determine one or morekeywords associated with a given document that represent the subjectmatter of that document. This may be performed utilizing one or moredocument processing techniques, such as term frequency inverse documentfrequency techniques, for example.

At block 2308, the process 2300 may include generating a classificationmodel configured to determine classification of the documents, theclassification model trained utilizing the keywords as indicators ofsubject matter that is out of class. For example, the model may betrained to accept text data representing a sample document and todetermine keywords representative of that sample document. Then, themodel may be trained to compare those keywords to reference keywordsthat indicate out of class subject matter.

At block 2310, the process 2300 may include determining, utilizing theclassification model: a second portion of the documents that are out ofclass; and a third portion of the documents that are in class. Forexample, the models may predict which of the documents that have notbeen labeled in response to user input correlate more to in classdocuments than to out of class documents, and vice versa.

Additionally, or alternatively, the process 2300 may include generatinga tokenized version of the first portion of the documents, the tokenizedversion including lexical tokens representing elements of the firstportion of the documents. The process 2300 may also include applying abigram-based language model to the tokenized version of the firstportion of the documents, the bigram-based language model configured todetermine a bigram frequency of the lexical tokens. The process 2300 mayalso include selecting keywords corresponding to a portion of thelexical tokens having a high bigram frequency with respect to lexicaltokens other than the portion of the lexical tokens.

Additionally, or alternatively, the process 2300 may include causingdisplay of the keywords via the user interface, wherein the keywords aredisplayed in a manner that indicates a ranking of the statisticalclassification of the keywords. The process 2300 may also includereceiving, via the user interface, user input indicating at least oneof: an indication that a first keyword of the keywords should be rankedas more statistically relevant by the classification model; or anindication that a second keyword of the keywords should be ranked asless statistically relevant by the classification model. The process2300 may also include retraining the classification model based at leastin part on the user input.

Additionally, or alternatively, the process 2300 may include determiningone or more categories of the first portion of the documents, the one ormore categories based at least in part on predefined classification ofthe individual ones of the documents by a system associated with the oneor more databases. The process 2300 may also include causing display ofthe one or more categories via the user interface and receiving, via theuser interface, user input indicating a category of the one or morecategories that is out of class. The process 2300 may also includeidentifying a set of the documents that is associated with the categoryand including the set of the documents in the second portion of thedocuments determined to be out of class.

FIG. 24 illustrates a flow diagram of another example process 2400 forbuilding classification models utilizing negative datasets. The order inwhich the operations or steps are described is not intended to beconstrued as a limitation, and any number of the described operationsmay be combined in any order and/or in parallel to implement process2400. The operations described with respect to the process 2400 aredescribed as being performed by a client device, and/or a systemassociated with the document analysis platform. However, it should beunderstood that some or all of these operations may be performed by someor all of components, devices, and/or systems described herein.

At block 2402, the process 2400 may include storing first datarepresenting documents including at least one of patents or patentapplications. For example, the documents may be received based at leastin part on user input and/or as determined by a document analysisplatform.

At block 2404, the process 2400 may include receiving user input dataindicating a first portion of the documents that are out of class. Forexample, the user may review all or a portion of the document andprovide user input indicating whether the document is in class or out ofclass. This may be performed utilizing a user interface such as the userinterface 350 described with respect to FIG. 3B.

At block 2406, the process 2400 may include determining features of thefirst portion of the documents that represent the first portion of thedocuments. For example, the features may include keywords that representthe subject matter of the documents and/or vectors that represent thesubject matter of the documents.

At block 2408, the process 2400 may include generating a modelconfigured to determine classification of the documents, the modeltrained based at least in part on the features. As described more fullyherein, the models may be trained utilizing the features to determinewhether keywords of a sample document, and/or a vector representing thesample document, more closely correlate to keywords and/or vectors ofdocuments labeled as in class or to keywords and/or vectors of documentslabeled as out of class.

At block 2410, the process 2400 may include determining, utilizing themodel, a second portion of the documents that are out of class. Forexample, the model may be utilized to determine which of the documentshave keywords and/or vectors that more closely correspond to documentslabeled out of class than to documents labeled in class.

Additionally, or alternatively, the process 2400 may include generatinga tokenized version of the first portion of the documents, the tokenizedversion including lexical tokens representing elements of the firstportion of the documents. The process 2400 may also include applying abigram-based language model to the tokenized version of the firstportion of the documents, the bigram-based language model configured todetermine a bigram frequency of the lexical tokens. The process 2400 mayalso include selecting a portion of the lexical tokens having a highbigram frequency with respect to lexical tokens other than the portionof the lexical tokens.

Additionally, or alternatively, the process 2400 may include causingdisplay of keywords corresponding to the features in a manner thatindicates a ranking of the classification of the keywords. The process2400 may also include receiving user input indicating at least one of:an indication that a first keyword of the keywords should be ranked asmore in class; or an indication that a second keyword of the keywordsshould be ranked as less in class. The process 2400 may also includeretraining the model based at least in part on the user input.

Additionally, or alternatively, the process 2400 may include determininga category of a document in the first portion of the documents, thecategory based at least in part on classification of the document by asystem from which the document was acquired. The process 2400 may alsoinclude receiving user input indicating the category is out of class.The process 2400 may also include identifying a set of the documentsthat is associated with the category and including the set of thedocuments in the second portion of the documents determined to be out ofclass.

Additionally, or alternatively, the process 2400 may include determininga set of documents identified as in class via user input data. Theprocess 2400 may also include generating a first dataset of the firstportion of the documents identified as out of class and generating asecond dataset of the set of documents identified as out of class. Inthese examples, generating the model may comprise training the modelbased at least in part on the first dataset and the second dataset.

Additionally, or alternatively, the process 2400 may include determininga first correlation between the individual ones of the documents and thekeywords and determining a second correlation between the individualones of the documents and a set of documents indicated to be in classvia user input. The process 2400 may also include determining that thefirst correlation is greater than the second correlation.

Additionally, or alternatively, the process 2400 may include causingdisplay of the features via a user interface and receiving user input,via the user interface, indicating that a feature of the features shouldbe marked as in class instead of out of class. The process 2400 may alsoinclude retraining the model based at least in part on the user input.

Additionally, or alternatively, the process 2400 may include determininga category of a document in the first portion of the documents, thecategory based at least in part on classification of the document by asystem from which the document was acquired. The process 2400 may alsoinclude querying one or more databases for additional documentsassociated with the category. The process 2400 may also includeretraining the model based at least in part on the additional documents.

FIG. 25 illustrates a flow diagram of an example process 2500 forutilizing classification models for determining classification ofexample documents. The order in which the operations or steps aredescribed is not intended to be construed as a limitation, and anynumber of the described operations may be combined in any order and/orin parallel to implement process 2500. The operations described withrespect to the process 2500 are described as being performed by a clientdevice, and/or a system associated with the document analysis platform.However, it should be understood that some or all of these operationsmay be performed by some or all of components, devices, and/or systemsdescribed herein.

At block 2502, the process 2500 may include receiving, via a userinterface associated with a platform for classifying documents as inclass or out of class, first user input data indicating a first portionof the documents are in class. For example, the system may receive userinput data indicating in class documents and out of class documents froma subset of first documents. For example, if the first documents include1,000 documents, the user input data may indicating classification for asubset, such as 20, of those documents. Users may utilize a userinterface to provide user input, such as the user interface 350 fromFIG. 3B.

At block 2504, the process 2500 may include receiving, via the userinterface, second user input data indicating a second portion of thedocuments are out of class. This may be performed in the same or asimilar manner as described with respect to block 2502, above.

At block 2506, the process 2500 may include determining a first set offeatures of the first portion of the documents that are representativeof the first portion of the documents. For example, the platform maygenerate a positive training dataset indicating in class keywordsassociated with the documents marked in class by a user. For example,the platform may determine one or more keywords associated with a givendocument that represent the subject matter of that document. This may beperformed utilizing one or more document processing techniques, such asterm frequency inverse document frequency techniques, for example. Whenvectors are utilized, the first set of feature may include vectors thatrepresent documents labeled as in class.

At block 2508, the process 2500 may include determining a second set offeatures of the second portion of the documents that are representativeof the second portion of the documents. This process may be performed inthe same or a similar manner as the processes described with respect toblock 2506, but with the documents labeled as out of class.

At block 2510, the process 2500 may include training a classificationmodel based at least in part on the first features and the secondfeatures, the classification model trained to determine a classificationof individual ones of the documents by analyzing features of theindividual ones of the documents in association with the first featuresand the second features. For example, the system may utilize that userinput data to train a classification model such that the classificationmodel is configured to determine whether a given document is moresimilar to those documents marked in class or more similar to thosedocuments marked out of class. To train the classification modelsutilizing this user input data, the document analysis platform mayperform one or more operations. In some examples, the platform maygenerate a positive training dataset indicating in class keywordsassociated with the documents marked in class by a user. For example,the platform may determine one or more keywords associated with a givendocument that represent the subject matter of that document. This may beperformed utilizing one or more document processing techniques, such asterm frequency inverse document frequency techniques, for example. Theplatform may also generate a negative training dataset indicatingkeywords from the documents marked out of class by the user input. Eachof these training datasets may then be utilized to train theclassification model such that the classification model is configured todetermine whether a given document has keywords that are more similar tothe in class keywords than to the out of class keywords. In otherexamples, instead of or in addition to generating training datasetsbased on keywords, the platform may determine a vector for a givendocument. The vector may be associated with a coordinate system and mayrepresent the subject matter of the document in the form of a vector.Vectors may be generated for the documents labeled in class and for thedocuments labeled out of class. The classification model may be trainedto determine whether a vector representation of a given document iscloser to the in class vectors than to the out of class vectors in thecoordinate system. Techniques to generate vectors representing documentsmay include vectorization techniques such as Doc2Vec, or other similartechniques.

In addition to the techniques for training the classification modelsdescribed above, the classification models may also be trained and/ororganized based at least in part on classifications of the documents.For example, when the documents are patents and patent applications, apredetermined classification system may be established for classifyingthe subject matter of a given document. The classification system may bedetermined by the platform, by one or more users, and/or by a thirdparty. For example, patents and patent application may be associatedwith a predefined classification system such as the Cooperative PatentClassification (CPC) system. The CPC system employs CPC codes thatcorrespond to differing subject matter, as described in more detailherein. The CPC codes for a given document may be identified and thecategories associated with those codes may be determined. A userinterface may be presented to the user that presents the determinedcategories and allows a user to select which categories the user findsin class for a given purpose. The selected categories may be utilized asa feature for training the classification models. Additionally, oralternatively, the platform may determine the CPC codes for documentsmarked as in class and may train the classification models to comparethose CPC codes with the CPC codes associated with the documents to beanalyzed to determine classification.

At block 2512, the process 2500 may include determining, utilizing theclassification model, a third portion of the documents that are inclass. For example, the model may be utilized to determine which of thedocuments have keywords and/or vectors that more closely correspond todocuments labeled in class than to documents labeled out of class.

At block 2514, the process 2500 may include determining, utilizing theclassification model, a fourth portion of the documents that are out ofclass. For example, the model may be utilized to determine which of thedocuments have keywords and/or vectors that more closely correspond todocuments labeled out of class than to documents labeled in class.

Additionally, or alternatively, the process 2500 may include determininga first confidence score that a document of the documents correlates tothe first features. The process 2500 may also include determining asecond confidence score that the document correlates to the secondfeatures. The process 2500 may also include determining that the firstconfidence score indicates more confidence than the second confidencescore. The process 2500 may also include associating the document withthe third portion of the documents based at least in part on the firstconfidence score indicating more confidence than the second confidencescore.

Additionally, or alternatively, the process 2500 may include determininga document type of the documents, the document type indicating at leastone of a subject matter associated with the documents, a database fromwhich the documents were received, or a format of the documents. Theprocess 2500 may also include selecting a base model of multiple modelsbased at least in part on the document type, wherein the base model hasbeen configured to generate output utilizing the document type.

Additionally, or alternatively, the process 2500 may include receivingthird user input data indicating additional classificationdeterminations associated with the documents other than the firstportion of the documents. The process 2500 may also include retrainingthe classification model based at least in part on the third user inputdata. The process 2500 may also include determining a difference in thenumber of the documents in the third portion after utilizing theclassification model as retrained. The process 2500 may also includegenerating a labeling influence value indicating a degree of influenceon the classification model by additional user input data.

FIG. 26 illustrates a flow diagram of another example process 2600 forutilizing classification models for determining classification ofexample documents. The order in which the operations or steps aredescribed is not intended to be construed as a limitation, and anynumber of the described operations may be combined in any order and/orin parallel to implement process 2600. The operations described withrespect to the process 2600 are described as being performed by a clientdevice, and/or a system associated with the document analysis platform.However, it should be understood that some or all of these operationsmay be performed by some or all of components, devices, and/or systemsdescribed herein.

At block 2602, the process 2600 may include receiving first user inputdata indicating a first portion of documents are in class. For example,the system may receive user input data indicating in class documents andout of class documents from a subset of first documents. For example, ifthe first documents include 1,000 documents, the user input data mayindicating classification for a subset, such as 20, of those documents.Users may utilize a user interface to provide user input, such as theuser interface 350 from FIG. 3B.

At block 2604, the process 2600 may include receiving second user inputdata indicating a second portion of the documents are out of class. Thismay be performed in the same or a similar manner as described withrespect to block 2602, above, except that the user input data may bewith respect to out of class documents.

At block 2606, the process 2600 may include determining a first set offeatures of the first portion of the documents that are representativeof the first portion of the documents. For example, the platform maygenerate a positive training dataset indicating in class keywordsassociated with the documents marked in class by a user. For example,the platform may determine one or more keywords associated with a givendocument that represent the subject matter of that document. This may beperformed utilizing one or more document processing techniques, such asterm frequency inverse document frequency techniques, for example. Whenvectors are utilized, the first set of feature may include vectors thatrepresent documents labeled as in class.

At block 2608, the process 2600 may include determining a second set offeatures of the second portion of the documents that are representativeof the second portion of the documents. This process may be performed inthe same or a similar manner as the processes described with respect toblock 2606, but with the documents labeled as out of class.

At block 2610, the process 2600 may include training a model based atleast in part on the first features and the second features, the modeltrained to determine a classification of individual ones of thedocuments. For example, the system may utilize that user input data totrain a classification model such that the classification model isconfigured to determine whether a given document is more similar tothose documents marked in class or more similar to those documentsmarked out of class. To train the classification models utilizing thisuser input data, the document analysis platform may perform one or moreoperations. In some examples, the platform may generate a positivetraining dataset indicating in class keywords associated with thedocuments marked in class by a user. For example, the platform maydetermine one or more keywords associated with a given document thatrepresent the subject matter of that document. This may be performedutilizing one or more document processing techniques, such as termfrequency inverse document frequency techniques, for example. Theplatform may also generate a negative training dataset indicatingkeywords from the documents marked out of class by the user input. Eachof these training datasets may then be utilized to train theclassification model such that the classification model is configured todetermine whether a given document has keywords that are more similar tothe in class keywords than to the out of class keywords. In otherexamples, instead of or in addition to generating training datasetsbased on keywords, the platform may determine a vector for a givendocument. The vector may be associated with a coordinate system and mayrepresent the subject matter of the document in the form of a vector.Vectors may be generated for the documents labeled in class and for thedocuments labeled out of class. The classification model may be trainedto determine whether a vector representation of a given document iscloser to the in class vectors than to the out of class vectors in thecoordinate system. Techniques to generate vectors representing documentsmay include vectorization techniques such as Doc2Vec, or other similartechniques.

In addition to the techniques for training the classification modelsdescribed above, the classification models may also be trained and/ororganized based at least in part on classifications of the documents.For example, when the documents are patents and patent applications, apredetermined classification system may be established for classifyingthe subject matter of a given document. The classification system may bedetermined by the platform, by one or more users, and/or by a thirdparty. For example, patents and patent application may be associatedwith a predefined classification system such as the Cooperative PatentClassification (CPC) system. The CPC system employs CPC codes thatcorrespond to differing subject matter, as described in more detailherein. The CPC codes for a given document may be identified and thecategories associated with those codes may be determined. A userinterface may be presented to the user that presents the determinedcategories and allows a user to select which categories the user findsin class for a given purpose. The selected categories may be utilized asa feature for training the classification models. Additionally, oralternatively, the platform may determine the CPC codes for documentsmarked as in class and may train the classification models to comparethose CPC codes with the CPC codes associated with the documents to beanalyzed to determine classification.

At block 2612, the process 2600 may include determining, based at leastin part on the model, a third portion of the documents that are inclass. For example, the model may be utilized to determine which of thedocuments have keywords and/or vectors that more closely correspond todocuments labeled in class than to documents labeled out of class.

At block 2614, the process 2600 may include determining, based at leastin part on the model, a fourth portion of the documents that are out ofclass. For example, the model may be utilized to determine which of thedocuments have keywords and/or vectors that more closely correspond todocuments labeled out of class than to documents labeled in class.

Additionally, or alternatively, the process 2600 may include determiningthat a document of the documents correlates to the first features morethan the second features. The process 2600 may also include associatingthe document with the third portion of the documents based at least inpart on the document correlating to the first features more than thesecond features.

Additionally, or alternatively, the process 2600 may include determininga document type of the documents. The process 2600 may also includeselecting a base model of multiple models based at least in part on thedocument type, wherein the base model has been configured to generateoutput utilizing the document type.

Additionally, or alternatively, the process 2600 may include receivethird user input data indicating additional classificationdeterminations associated with the documents other than the firstportion of the documents. The process 2600 may also include retrainingthe model based at least in part on the third user input data. Theprocess 2600 may also include determining a difference in the number ofthe documents in the third portion after utilizing the model asretrained. The process 2600 may also include generating a labelinginfluence value indicating a degree of influence on the model byadditional user input data.

Additionally, or alternatively, the process 2600 may includedetermining, for individual ones of the first portion of the documents,a first vector representing the first features. The process 2600 mayalso include determining, for individual ones of the second portion ofthe documents, a second vector representing the second features. Theprocess 2600 may also include determining that a document of thedocuments is associated with the third portion of the documents based atleast in part on a third vector representing the document beingassociated with the first vector more than the second vector.

Additionally, or alternatively, the process 2600 may include identifyingadditional documents that differ from the documents. The process 2600may also include determining a classification of individual ones of theadditional documents utilizing the model. The process 2600 may alsoinclude determining a confidence value indicating performance of themodel for determining the classification of the individual ones of theadditional documents. The process 2600 may also include determining thatthe model has been trained successfully based at least in part on theconfidence value satisfying a threshold confidence value.

Additionally, or alternatively, the process 2600 may include causingdisplay of keywords representing the first features via a userinterface. The process 2600 may also include receiving user input, viathe user interface, indicating that a set of documents associated with akeyword of the keywords should be marked as in class instead of out ofclass. The process 2600 may also include retraining the model based atleast in part on the user input.

Additionally, or alternatively, the process 2600 may include receivingthird user input data indicating classification of a document associatedwith the third portion of the documents, the third user input dataindicating that the document is out of class. The process 2600 may alsoinclude determining third features representing the document. Theprocess 2600 may also include retraining the model based at least inpart on the third features.

FIG. 27 illustrates a flow diagram of an example process 2700 forbuilding model taxonomies. The order in which the operations or stepsare described is not intended to be construed as a limitation, and anynumber of the described operations may be combined in any order and/orin parallel to implement process 2700. The operations described withrespect to the process 2700 are described as being performed by a clientdevice, and/or a system associated with the document analysis platform.However, it should be understood that some or all of these operationsmay be performed by some or all of components, devices, and/or systemsdescribed herein.

At block 2702, the process 2700 may include generating classificationmodels configured to identify a document of multiple documents as inclass or out of class, the classification models trained utilizing userinput data indicating a first portion of the documents as in class and asecond portion of the documents as out of class, wherein the documentsinclude patents and patent applications. For example, the system mayutilize that user input data to train a classification model such thatthe classification model is configured to determine whether a givendocument is more similar to those documents marked in class or moresimilar to those documents marked out of class. To train theclassification models utilizing this user input data, the documentanalysis platform may perform one or more operations. In some examples,the platform may generate a positive training dataset indicating inclass keywords associated with the documents marked in class by a user.For example, the platform may determine one or more keywords associatedwith a given document that represent the subject matter of thatdocument. This may be performed utilizing one or more documentprocessing techniques, such as term frequency inverse document frequencytechniques, for example. The platform may also generate a negativetraining dataset indicating keywords from the documents marked out ofclass by the user input. Each of these training datasets may then beutilized to train the classification model such that the classificationmodel is configured to determine whether a given document has keywordsthat are more similar to the in class keywords than to the out of classkeywords. In other examples, instead of or in addition to generatingtraining datasets based on keywords, the platform may determine a vectorfor a given document. The vector may be associated with a coordinatesystem and may represent the subject matter of the document in the formof a vector. Vectors may be generated for the documents labeled in classand for the documents labeled out of class. The classification model maybe trained to determine whether a vector representation of a givendocument is closer to the in class vectors than to the out of classvectors in the coordinate system. Techniques to generate vectorsrepresenting documents may include vectorization techniques such asDoc2Vec, or other similar techniques.

In addition to the techniques for training the classification modelsdescribed above, the classification models may also be trained and/ororganized based at least in part on classifications of the documents.For example, when the documents are patents and patent applications, apredetermined classification system may be established for classifyingthe subject matter of a given document. The classification system may bedetermined by the platform, by one or more users, and/or by a thirdparty. For example, patents and patent application may be associatedwith a predefined classification system such as the Cooperative PatentClassification (CPC) system. The CPC system employs CPC codes thatcorrespond to differing subject matter, as described in more detailherein. The CPC codes for a given document may be identified and thecategories associated with those codes may be determined. A userinterface may be presented to the user that presents the determinedcategories and allows a user to select which categories the user findsin class for a given purpose. The selected categories may be utilized asa feature for training the classification models. Additionally, oralternatively, the platform may determine the CPC codes for documentsmarked as in class and may train the classification models to comparethose CPC codes with the CPC codes associated with the documents to beanalyzed to determine classification.

At block 2704, the process 2700 may include determining, for individualones of the classification models, a technology category and one or moresubcategories associated with the individual ones of the classificationmodels, the technology category and the one or more subcategoriesassociated with a classification system associated with the multipledocuments. For example, when the documents are patents and patentapplications, a predetermined classification system may be establishedfor classifying the subject matter of a given document. Theclassification system may be determined by the platform, by one or moreusers, and/or by a third party. For example, patents and patentapplication may be associated with a predefined classification systemsuch as the Cooperative Patent Classification (CPC) system. The CPCsystem employs CPC codes that correspond to differing subject matter.The CPC codes for a given document may be identified and the categoriesassociated with those codes may be determined.

At block 2706, the process 2700 may include generating a taxonomy of theclassification models, the taxonomy indicating categorical relationshipsbetween the classification models, generating the taxonomy based atleast in part on the technology category and the one or moresubcategories associated with the individual ones of the classificationmodels. For example, the taxonomy may include nodes representing themodels and linkages between the nodes representing relationships betweenthe models.

Additionally, or alternatively, the process 2700 may include determininga code associated with individual ones of the documents, the codeassociated with the individual ones of the documents based at least inpart on a codification system associated with the documents. The process2700 may also include determining a first portion of the code associatedwith the technology category. The process 2700 may also includedetermining a second portion of the code associated with the one or moresubcategories. The process 2700 may also include determining a tier ofthe taxonomy to associate the individual ones of the classificationmodels based at least in part on the technology category and the one ormore subcategories.

Additionally, or alternatively, the process 2700 may include determiningthat a first classification model of the classification models isassociated with a first node of the taxonomy. The process 2700 may alsoinclude determining that a second classification model of theclassification models is associated with a second node of the taxonomy.The process 2700 may also include determining that the first node andthe second node are linked in the taxonomy and generating an indicationthat the first classification model is related to the secondclassification model based at least in part on the first node and thesecond node being linked in the taxonomy.

Additionally, or alternatively, the process 2700 may include receiving,for a classification model of the classification models, a firsttraining dataset configured to train the classification model todetermine which of the documents are in class. The process 2700 may alsoinclude receiving an indication that a portion of the first trainingdataset includes confidential information. The process 2700 may alsoinclude generating a modified classification model corresponding to theclassification model trained without the portion of the first trainingdataset that includes the confidential information. In these examples,generating the taxonomy by utilizing the modified classification modelinstead of the classification model.

FIG. 28 illustrates a flow diagram of another example process 2800 forutilizing model taxonomies. The order in which the operations or stepsare described is not intended to be construed as a limitation, and anynumber of the described operations may be combined in any order and/orin parallel to implement process 2800. The operations described withrespect to the process 2800 are described as being performed by a clientdevice, and/or a system associated with the document analysis platform.However, it should be understood that some or all of these operationsmay be performed by some or all of components, devices, and/or systemsdescribed herein.

At block 2802, the process 2800 may include generating models configuredto identify a document as in class or out of class, the models trainedutilizing user input data indicating a first portion of documents as inclass and a second portion of documents as out of class. For example,the system may utilize that user input data to train a classificationmodel such that the classification model is configured to determinewhether a given document is more similar to those documents marked inclass or more similar to those documents marked out of class. To trainthe classification models utilizing this user input data, the documentanalysis platform may perform one or more operations. In some examples,the platform may generate a positive training dataset indicating inclass keywords associated with the documents marked in class by a user.For example, the platform may determine one or more keywords associatedwith a given document that represent the subject matter of thatdocument. This may be performed utilizing one or more documentprocessing techniques, such as term frequency inverse document frequencytechniques, for example. The platform may also generate a negativetraining dataset indicating keywords from the documents marked out ofclass by the user input. Each of these training datasets may then beutilized to train the classification model such that the classificationmodel is configured to determine whether a given document has keywordsthat are more similar to the in class keywords than to the out of classkeywords. In other examples, instead of or in addition to generatingtraining datasets based on keywords, the platform may determine a vectorfor a given document. The vector may be associated with a coordinatesystem and may represent the subject matter of the document in the formof a vector. Vectors may be generated for the documents labeled in classand for the documents labeled out of class. The classification model maybe trained to determine whether a vector representation of a givendocument is closer to the in class vectors than to the out of classvectors in the coordinate system. Techniques to generate vectorsrepresenting documents may include vectorization techniques such asDoc2Vec, or other similar techniques.

In addition to the techniques for training the classification modelsdescribed above, the classification models may also be trained and/ororganized based at least in part on classifications of the documents.For example, when the documents are patents and patent applications, apredetermined classification system may be established for classifyingthe subject matter of a given document. The classification system may bedetermined by the platform, by one or more users, and/or by a thirdparty. For example, patents and patent application may be associatedwith a predefined classification system such as the Cooperative PatentClassification (CPC) system. The CPC system employs CPC codes thatcorrespond to differing subject matter, as described in more detailherein. The CPC codes for a given document may be identified and thecategories associated with those codes may be determined. A userinterface may be presented to the user that presents the determinedcategories and allows a user to select which categories the user findsin class for a given purpose. The selected categories may be utilized asa feature for training the classification models. Additionally, oralternatively, the platform may determine the CPC codes for documentsmarked as in class and may train the classification models to comparethose CPC codes with the CPC codes associated with the documents to beanalyzed to determine classification.

At block 2804, the process 2800 may include determining, for individualones of the models, a category associated with individual ones of themodels, the category associated with a classification system associatedwith the documents. For example, when the documents are patents andpatent applications, a predetermined classification system may beestablished for classifying the subject matter of a given document. Theclassification system may be determined by the platform, by one or moreusers, and/or by a third party. For example, patents and patentapplication may be associated with a predefined classification systemsuch as the Cooperative Patent Classification (CPC) system. The CPCsystem employs CPC codes that correspond to differing subject matter.The CPC codes for a given document may be identified and the categoriesassociated with those codes may be determined.

At block 2806, the process 2800 may include generating a taxonomy of themodels, the taxonomy indicating categorical relationships between themodels, wherein generating the taxonomy is based at least in part on thecategory associated with the individual ones of the models. For example,the taxonomy may include nodes representing the models and linkagesbetween the nodes representing relationships between the models.

Additionally, or alternatively, the process 2800 may include determininga classifier associated with individual ones of the documents, theclassifier based at least in part on a classification system thatutilizes the documents. The process 2800 may also include determiningthe category based at least in part on the classifier. The process 2800may also include determining a tier of the taxonomy to associate theindividual ones of the classification models based at least in part onthe category.

Additionally, or alternatively, the process 2800 may include determiningthat a first model of the models is associated with a first node of thetaxonomy. The process 2800 may also include determining that a secondmodel of the models is associated with a second node of the taxonomy.The process 2800 may also include determining that the first node andthe second node are linked in the taxonomy. The process 2800 may alsoinclude generating an indication that the first model is related to thesecond model based at least in part on the first node and the secondnode being linked in the taxonomy.

Additionally, or alternatively, the process 2800 may include receiving,for a model of the models, a training dataset configured to train themodel to determine which of the documents are in class to the model. Theprocess 2800 may also include receiving an indication that a portion ofthe training dataset includes confidential information. The process 2800may also include generating a modified model corresponding to the modeltrained without the portion of the training dataset that includes theconfidential information. In these examples, generating the taxonomy mayinclude generating the taxonomy based at least in part on the modifiedmodel.

Additionally, or alternatively, the process 2800 may include receivingdata indicating a restriction associated with the user input data for amodel of the models and determining that the restriction is userspecific. The process 2800 may also include determining, based at leastin part on the restriction being user specific, a portion of a trainingdataset configured to train the model that is associated with therestriction. The process 2800 may also include generating a modifiedmodel corresponding to the model trained without the portion of thetraining dataset.

Additionally, or alternatively, the process 2800 may include receivingrequest data for use of at least one of the classification models fordetermining classification of a set of documents, the request dataindicating keywords associated with the set of documents. The process2800 may also include determining a category of multiple categoriesassociated with the keywords and selecting a model of the models basedat least in part on the category. The process 2800 may also includedetermining a portion of the set of documents that are in classutilizing the model.

Additionally, or alternatively, the process 2800 may includedetermining, based at least in part on classifiers associated with aclassification system, that the models are not associated with aclassifier of the classifiers. The process 2800 may also includedetermining a technology category associated with classifier. Theprocess 2800 may also include generating an indication that the modelsdo not include a model associated with the technology category.

Additionally, or alternatively, the process 2800 may include determiningfirst features representing a first model of the models and determiningsecond features representing a second model of the models. The process2800 may also include determining a similarity value of the firstfeatures to the second features. The process 2800 may also includeassociating the first model with the second model in the taxonomy basedat least in part on the similarity value.

FIG. 29 illustrates a flow diagram of an example process 2900 forsearching classification models using taxonomies. The order in which theoperations or steps are described is not intended to be construed as alimitation, and any number of the described operations may be combinedin any order and/or in parallel to implement process 2900. Theoperations described with respect to the process 2900 are described asbeing performed by a client device, and/or a system associated with thedocument analysis platform. However, it should be understood that someor all of these operations may be performed by some or all ofcomponents, devices, and/or systems described herein.

At block 2902, the process 2900 may include storing a taxonomy ofclassification models, the classification models each configured toreceive documents and determine a classification of individual ones ofthe documents, each of the classification models trained based at leastin part on a document dataset indicated to be in class from user inputdata, the documents comprising patents and patent applications. Forexample, once classification models are trained such that the models aredetermined to accurately predict classification as trained, the modelsmay be placed in a model taxonomy. The model taxonomy may represent ataxonomy tree or otherwise a model hierarchy indicating relationshipsbetween models and/or a level of specificity associated with the models.This taxonomy may be searchable and may provide functionality thatallows a user to provide a search query for a model. The keywords fromthe search query may be utilized to identify models that may beapplicable to the search query and/or to highlight “branches” of thetaxonomy associated with the search query.

The models in the model taxonomy may be linked to each other in one ormore ways. For example, when the subject matter of one model is relatedto the subject matter of another model, those models may be linked inthe taxonomy. In some examples, the nodes of the taxonomy representingthe models may be determined utilizing a predefined subject matterclassification system, such as the CPC system described herein.

At block 2904, the process 2900 may include generating a user interfaceconfigured to accept user input representing a search query, the searchquery including keywords from the user input. For example, the userinterface may have one or more input fields configured to receive userinput, such as text and/or audio representing the keywords to besearched for.

At block 2906, the process 2900 may include determining, utilizing thekeywords, a portion of the classification models that is associated withthe search query. For example, in examples where the models arerepresented by keywords, those reference keywords may be compared to thekeywords from the search query to determine a correlation between thesearch query keywords and the reference keywords. For those models witha high correlation and/or a best correlation as compared to othermodels, those models may be identified as responsive to the searchquery. Additionally, models that are related to the highly-correlativemodels may, in examples, also be determined.

At block 2908, the process 2900 may include causing display, via theuser interface, of search results for the search query, the searchresults including an indication of a portion of the taxonomy associatedwith the portion of the classification models, the search results alsoindicating a classification model of the classification modelsdetermined to be most related to the search query. For example, the userinterface may be utilized to display indications of the modelsidentified during a model search, and the user interface may beconfigured to receive user input indicating selection of a given modelfor use in determining classification of documents. The user and/or theplatform may then upload the document set to be analyzed and theselected model may be utilized to predict classification of individualones of the documents. A user interface indicating the classificationpredictions as performed utilizing the selected model may be displayedas well as a confidence value associated with the accuracy of the modelin determining classification. This may provide the user with anindication of whether the selected model is sufficient for analyzing thedocuments at issue, or whether another model should be selected, or anew model should be trained.

Additionally, or alternatively, the process 2900 may include storing, inassociation with individual ones of the classification models, referencekeywords representing how the individual ones of the classificationmodels have been trained to determine classification. The process 2900may also include determining a similarity value between the keywordsassociated with the search query and the reference keywords. The process2900 may also include determining the first classification model that ismost related to the search query based at least in part on thesimilarity value. The process 2900 may also include determining a tierof the taxonomy associated with the first classification model. In theseexamples, the search results include: a second classification modelassociated with a second tier of the taxonomy, the second tierindicating a broader technological category than the first tier; and athird classification model associated with a third tier of the taxonomy,the third tier indicating a more specific technological category thanthe first tier.

Additionally, or alternatively, the process 2900 may include receiving,via the user interface, user input data indicating selection of theclassification model. The process 2900 may include causing display of arequest for sample documents for input into the classification model.The process 2900 may include receiving document data corresponding tothe sample documents. The process 2900 may also include determining,utilizing the classification model, a first portion of the sampledocuments determined to be in class. The process 2900 may also includedetermining, utilizing the classification model, a second portion of thesample documents determined to be in class. The process 2900 may alsoinclude causing displaying, via the user interface, of: a firstindication of the first portion of the sample documents; a secondindication of the second portion of the sample documents; and a thirdindication of a confidence value that the classification modelaccurately determined the first portion and the second portion.

Additionally, or alternatively, the process 2900 may include receiving,via the user interface, user input data indicating selection of theclassification model. The process 2900 may also include causing displayof a request for sample documents for input into the classificationmodel. The process 2900 may also include receiving document datacorresponding to the sample documents. The process 2900 may also includedetermining, utilizing output of the classification model indicatingclassification of individual ones of the sample documents, a ranking ofthe sample documents. The process 2900 may also include causing displayof at least an indication of the sample documents in an ordercorresponding to the ranking.

FIG. 30 illustrates a flow diagram of another example process 3000 forsearching classification models using taxonomies. The order in which theoperations or steps are described is not intended to be construed as alimitation, and any number of the described operations may be combinedin any order and/or in parallel to implement process 3000. Theoperations described with respect to the process 3000 are described asbeing performed by a client device, and/or a system associated with thedocument analysis platform. However, it should be understood that someor all of these operations may be performed by some or all ofcomponents, devices, and/or systems described herein.

At block 3002, the process 3000 may include storing a taxonomy of modelstrained to determine a classification of individual ones of documents,individual ones of the models trained based at least in part on adocument dataset indicated to be in class at least in part from firstuser input data. For example, once classification models are trainedsuch that the models are determined to accurately predict classificationas trained, the models may be placed in a model taxonomy. The modeltaxonomy may represent a taxonomy tree or otherwise a model hierarchyindicating relationships between models and/or a level of specificityassociated with the models. This taxonomy may be searchable and mayprovide functionality that allows a user to provide a search query for amodel. The keywords from the search query may be utilized to identifymodels that may be applicable to the search query and/or to highlight“branches” of the taxonomy associated with the search query.

The models in the model taxonomy may be linked to each other in one ormore ways. For example, when the subject matter of one model is relatedto the subject matter of another model, those models may be linked inthe taxonomy. In some examples, the nodes of the taxonomy representingthe models may be determined utilizing a predefined subject matterclassification system, such as the CPC system described herein.

At block 3004, the process 3000 may include receiving second user inputdata representing a search query. For example, a user interface may haveone or more input fields configured to receive user input, such as textand/or audio representing the keywords to be searched for.

At block 3006, the process 3000 may include determining a portion of themodels that are associated with the search query. For example, inexamples where the models are represented by keywords, those referencekeywords may be compared to the keywords from the search query todetermine a correlation between the search query keywords and thereference keywords. For those models with a high correlation and/or abest correlation as compared to other models, those models may beidentified as responsive to the search query. Additionally, models thatare related to the highly-correlative models may, in examples, also bedetermined.

At block 3008, the process 3000 may include causing display of searchresults for the search query, the search results indicating a portion ofthe taxonomy associated with the portion of the models. For example, theuser interface may be utilized to display indications of the modelsidentified during a model search, and the user interface may beconfigured to receive user input indicating selection of a given modelfor use in determining classification of documents. The user and/or theplatform may then upload the document set to be analyzed and theselected model may be utilized to predict classification of individualones of the documents. A user interface indicating the classificationpredictions as performed utilizing the selected model may be displayedas well as a confidence value associated with the accuracy of the modelin determining classification. This may provide the user with anindication of whether the selected model is sufficient for analyzing thedocuments at issue, or whether another model should be selected, or anew model should be trained.

Additionally, or alternatively, the process 3000 may include storing, inassociation with the individual ones of the models, a referencerepresentation of the individual ones of the models. The process 3000may also include determining a similarity value between a samplerepresentation of the search query and the reference representation. Theprocess 3000 may also include determining a first model that is mostrelated to the search query based at least in part on the similarityvalue. The process 3000 may also include determining a tier of thetaxonomy associated with the first model. In these examples, the searchresults include the first model and at least one of: a second modelassociated with a second tier of the taxonomy, the second tierindicating a broader technological category than the first tier; or athird model associated with a third tier of the taxonomy, the third tierindicating a more specific technological category than the first tier.

Additionally, or alternatively, the process 3000 may include receivinguser input data indicating selection of a model of the portion of themodels. The process 3000 may also include causing display of a requestfor sample documents for input into the model. The process 3000 may alsoinclude receiving document data corresponding to the sample documents.The process 3000 may also include determining, utilizing the model, afirst portion of the sample documents determined to be in class anddetermining, utilizing the model, a second portion of the sampledocuments determined to be in class. The process 3000 may also includecausing displaying of: a first indication of the first portion of thesample documents; a second indication of the second portion of thesample documents; and a third indication of a confidence value that themodel accurately determined the first portion and the second portion.

Additionally, or alternatively, the process 3000 may include receivinguser input data indicating selection of the model. The process 3000 mayalso include causing display of a request for sample documents for inputinto the model and receiving document data corresponding to the sampledocuments. The process 3000 may also include determining, utilizingoutput of the model indicating classification of individual ones of thesample documents, a ranking of the sample documents. The process 3000may also include causing display of at least an indication of the sampledocuments in an order corresponding to the ranking.

Additionally, or alternatively, the process 3000 may include storing, inassociation with the individual ones of the models, a reference purposeof the individual ones of the models, the reference purpose indicatingby user input associated with training the individual ones of themodels. The process 3000 may also include determining a similarity valuebetween a sample purpose indicated in the search query and the referencepurpose. The process 3000 may also include determining a model that ismost related to the search query based at least in part on thesimilarity value. In these examples, the search results may indicate themodel.

Additionally, or alternatively, the process 3000 may include determiningthat a confidence value indicating a similarity between the portion ofthe models and the search query does not satisfy a threshold confidencevalue. In these examples, causing display of the search results mayinclude causing display, based at least in part on the confidence valuenot satisfying the threshold confidence value, of an option to train amodel not yet in the taxonomy.

Additionally, or alternatively, the process 3000 may include receivinguser input data indicating a classification of sample documents. Theprocess 3000 may also include training the model based at least in parton the user input data. The process 3000 may also include determining,from the user input data, a technological category associated with themodel. The process 3000 may also include causing the model to beincluded in the taxonomy based at least in part on the technologicalcategory.

Additionally, or alternatively, the process 3000 may include determininga first model that is most related to the search query. The process 3000may also include determining a tier of the taxonomy associated with thefirst model. The process 3000 may also include determining a secondmodel that is associated with the tier. In these examples, the searchresults may include the first model and the second model.

While the foregoing invention is described with respect to the specificexamples, it is to be understood that the scope of the invention is notlimited to these specific examples. Since other modifications andchanges varied to fit particular operating requirements and environmentswill be apparent to those skilled in the art, the invention is notconsidered limited to the example chosen for purposes of disclosure, andcovers all changes and modifications which do not constitute departuresfrom the true spirit and scope of this invention.

Although the application describes embodiments having specificstructural features and/or methodological acts, it is to be understoodthat the claims are not necessarily limited to the specific features oracts described. Rather, the specific features and acts are merelyillustrative some embodiments that fall within the scope of the claims.

1. (canceled)
 2. A method comprising: receiving documents that includeat least one of patents or patent applications; generating first datarepresenting the documents, the first data distinguishing components ofthe documents; generating a user interface configured to display: thecomponents of individual ones of the documents; and an elementconfigured to accept user input indicating whether the individual onesof the documents are in class or out of class; generating aclassification model based at least in part on user input datacorresponding to the user input, the classification model trainedutilizing at least a first portion of the documents indicated to be inclass by the user input data; and causing the user interface to displayan indication of: the first portion of the documents; a second portionof the documents marked as out of class in response to the user input; athird portion of the documents determined to be in class utilizing theclassification model; and a fourth portion of the documents determinedto be out of class utilizing the classification model.
 3. The method ofclaim 2, wherein the user input comprises first user input, the userinput data comprises first user input data, and the method furthercomprises: determining a first confidence value associated with resultsof the classification model; receiving second user input indicatingclassification of at least one document determined to be in classutilizing the classification model; causing the classification model tobe retrained based at least in part on second user input datacorresponding to the second user input; determining a second confidencevalue associated with results of the classification model as retained;and causing display of a trendline representing a change from the firstconfidence value to the second confidence value.
 4. The method of claim2, wherein the user input comprises first user input, the user inputdata comprises first user input data, and the method further comprises:receiving second user input indicating classification of at least onedocument determined to be in class utilizing the classification model;causing the classification model to be retrained based at least in parton second user input data corresponding to the second user input;determining a change in a number of the third portion of the documentsmarked in class utilizing the classification model as retrained; andcausing display of an influence value of the second user input on outputby the classification model, the influence value indicating a likelihoodthat additional user input will impact performance of the classificationmodel.
 5. The method of claim 2, further comprising: generating seconddata indicating a relationship between a first document of the documentsand a second document of the documents, the relationship indicating thatthe first document includes at least one component that is similar to acomponent of the second document; determining that the user input dataindicates that the first document is in class; and determining that thesecond document is in class based at least in part on the second dataindicating the relationship.
 6. A system, comprising: one or moreprocessors; and non-transitory computer-readable media storingcomputer-executable instructions that, when executed by the one or moreprocessors, cause the one or more processors to perform operationscomprising: generating first data representing documents received fromone or more databases; causing display of an element configured toaccept user input indicating whether individual ones of the documentsare in class or out of class; generating a model based at least in parton user input data corresponding to the user input; determining,utilizing the model, a first portion of the documents that are in classand a second portion of the documents that are out of class; and causingdisplay of an indication of the first portion of the documents and thesecond portion of the documents.
 7. The system of claim 6, wherein theuser input comprises first user input, the user input data comprisesfirst user input data, and the operations further comprise: determininga first confidence value associated with results of the model; receivingsecond user input indicating classification of at least one documentdetermined to be in class utilizing the model; causing the model to beretrained based at least in part second user input data corresponding tothe second user input; determining a second confidence value associatedwith results of the model as retained; and causing display of atrendline representing a change from the first confidence value to thesecond confidence value.
 8. The system of claim 6, wherein the userinput comprises first user input, the user input data comprises firstuser input data, and the operations further comprise: receiving seconduser input indicating classification of at least one document determinedto be in class utilizing the model; causing the model to be retrainedbased at least in part second user input data corresponding to thesecond user input; determining a change in a number of at least one ofthe first portion of the documents or the second portion of thedocuments utilizing the model as retrained; and causing display of aninfluence value of the second user input on output by the model.
 9. Thesystem of claim 6, the operations further comprising: generating seconddata indicating a relationship between a first document of the documentsand a second document of the documents; determining that the user inputdata indicates that the first document is in class; and determining thatthe second document is in class based at least in part on the seconddata indicating the relationship.
 10. The system of claim 6, theoperations further comprising: determining, for individual ones of thedocuments marked as in class, a confidence value; and determining aranking of the individual ones of the documents marked as in class basedat least in part on the confidence value.
 11. The system of claim 6, theoperations further comprising causing display of: a first indication ofa first number of the documents marked in class; a second indication ofa second number of the documents marked out of class; a third indicationof a third number of the documents determined to be in class utilizingthe model; and a fourth indication of a fourth number of the documentsdetermined to be out of class utilizing the model.
 12. The system ofclaim 6, the operations further comprising: causing display of: a firstsection indicating first keywords determined to be statisticallyrelevant by the model for identifying the first portion of thedocuments; and a second section indicating second keywords determined tobe statistically relevant by the model for identifying the secondportion of the documents; and based at least in part on receiving theuser input indicating that at least one of the first keywords or thesecond keywords should be removed, retraining the model to account forremoval of the at least one of the first keywords or the secondkeywords.
 13. The system of claim 6, the operations further comprising:searching, utilizing the model, one or more databases for additionaldocuments determined to be in class by the model; receiving an instanceof at least one additional documents from the one or more databases;receiving user input indicating classification of the at least oneadditional documents; and retraining the model based at least in part onthe user input indicating the classification of the additionaldocuments.
 14. A method, comprising: generating first data representingdocuments received from one or more databases; causing display of anelement configured to accept user input indicating whether individualones of the documents are in class or out of class; generating a modelbased at least in part on user input data corresponding to the userinput; and determining, based at least in part on the model and the userinput data, a first portion of the documents that are in class and asecond portion of the documents that are out of class.
 15. The method ofclaim 14, wherein the user input comprises first user input, the userinput data comprises first user input data, and the method furthercomprises: determining a first confidence value associated with resultsof the model; receiving second user input indicating classification ofat least one document determined to be in class utilizing the model;causing the model to be retrained based at least in part second userinput data corresponding to the second user input; determining a secondconfidence value associated with results of the model as retained; andcausing display of a trendline representing a change from the firstconfidence value to the second confidence value.
 16. The method of claim14, wherein the user input comprises first user input, the user inputdata comprises first user input data, and the method further comprises:receiving second user input indicating classification of at least onedocument determined to be in class utilizing the model; causing themodel to be retrained based at least in part second user input datacorresponding to the second user input; determining a change in a numberof the second portion of the documents determined to be in classutilizing the model as retrained; and causing display of an influencevalue of the second user input on output by the model.
 17. The method ofclaim 14, further comprising: generating second data indicating arelationship between a first document of the documents and a seconddocument of the documents; determining that the user input dataindicates that the first document is in class; and determining that thesecond document is in class based at least in part on the second dataindicating the relationship.
 18. The method of claim 14, furthercomprising: determining, for individual ones of the documents marked asin class, a confidence value indicating a degree of confidence that theindividual ones of the documents were marked correctly as in class; anddetermining a ranking of the individual ones of the documents marked asin class based at least in part on the confidence value.
 19. The methodof claim 14, further comprising causing display of: a first indicationof a first number of the documents marked in class in response to userinput; a second indication of a second number of the documents markedout of class in response to the user input; a third indication of athird number of the documents determined to be in class utilizing themodel; and a fourth indication of a fourth number of the documentsdetermined to be out of class utilizing the model.
 20. The method ofclaim 14, further comprising: causing display of: a first sectionindicating first keywords determined to be statistically relevant by themodel for identifying the first portion of the documents; and a secondsection indicating second keywords determined to be statisticallyrelevant by the model for identifying the second portion of thedocuments; and based at least in part on receiving the user inputindicating that at least one of the first keywords or the secondkeywords should be removed, retraining the model to account for removalof the at least one of the first keywords or the second keywords. 21.The method of claim 14, further comprising: searching, utilizing themodel, one or more databases for additional documents determined to bein class by the model; receiving an instance of the additional documentsfrom the one or more databases; receiving user input indicatingclassification of the additional documents; and retraining the modelbased at least in part on the user input indicating the classificationof the additional documents.